Machine learning with the “diabetes” data set in RClassification with KNN, logistic regression, and decision treesWilliam ButlerBlockedUnblockFollowFollowingJan 17Inspired by Susan Li’s article on applying basic machine learning techniques in Python, I decided to implement the same techniques in R.

In addition, I hope to expand somewhat the explanations for why each method is useful and how they compare to one another.

All of the analyses below use the Pima Indians diabetes data set, which can be accessed within R by:install.

packages("mlbench")library(mlbench)data(PimaIndiansDiabetes)# Some of the exact variable names may vary from my subsequent codeOverall, this data set consists of 768 observations of 9 variables: 8 variables which will be used as model predictors (number of times pregnant, plasma glucose concentration, diastolic blood pressure (mm Hg), triceps skin fold thickness (in mm), 2-hr serum insulin measure, body mass index, a diabetes pedigree function, and age) and 1 outcome variable (whether or not the patient has diabetes).

k-nearest neighborsWe’ll begin by applying the k-nearest neighbors method of classifying patients by their similarity to other patients.

For this method (and all subsequent methods), we’ll start by separating the data set into “training” and “test” sets.

We’ll build our model based on the relationship between the predictors and the outcome on the training set, then use the model’s specifications to predict the outcome on the test set.

We can then compare our predicted outcomes to the test set’s actual diabetes status to give us a measure of model accuracy.

For my exercises, I’ll use the sample.

split function from the caTools package.

An illustration of how the number of neighbors affects the class of a test case.

Using the 3 nearest neighbors (solid line) results in a different class than using the 5 nearest neighbors (dashed line).

By Antti Ajanki AnAj — https://commons.

wikimedia.

org/w/index.

php?curid=2170282For k-nearest neighbors, we compute the outcome for each test case by comparing that case to the “nearest neighbors” in the training set.

The assigned outcome depends on how many of these neighbors you decide to look at; the majority class of the three closest neighbors may be different than the majority class of the five closest neighbors (see figure at left).

To ensure we use a number for k that gives better model performance, I performed a two-part cross-validation.

First, I varied the possible values for k from 2 to 10; second, I repeated the splitting of the data into training and test sets 100 times to ensure a robust estimate of model performance for each k.

I used the knn function within the class package and computed model accuracy on the test set for each fold.

all_test_accuracies_knn <- matrix(nrow=100,ncol=9)for (split_number in c(1:100)){ train_ind <- sample.

split(diabetes$Pregnancies,SplitRatio = 0.

8) test_ind <- !train_ind neighbors <- c(2:10) accuracies <- matrix(nrow=1, ncol=9) for (n_neighbors in neighbors){ knn_fit <- knn(diabetes[train_ind,],diabetes[test_ind,],diabetes$Outcome[train_ind],k=n_neighbors) cm <- table(Actual = diabetes$Outcome[test_ind],Predicted = knn_fit) accuracy <- sum(diag(cm))/sum(test_ind) accuracies[n_neighbors-1] <- accuracy } all_test_accuracies_knn[split_number,] <- accuracies}KNN model performance accuracy for varying values of k.

Black line indicates mean of all 100 folds for each value of k; grey ribbon indicates standard deviation.

From this analysis, we can see that k-nearest neighbors performs better for somewhat larger values of k, with performance reaching a maximum of about 73% classification accuracy.

Though there is still some variance depending on the exact data split, using 9 or 10 neighbors seems to yield fairly stable model estimates on the test set.

logistic regressionNext, we’ll apply another of the basic workhorses of the machine learning toolset: regression.

For this data set, where we’re predicting a binary outcome (diabetes diagnosis), we’re using logistic regression rather than linear regression (to predict a continuous variable).

Again, I’ll cross-validate the logistic regression model by repeatedly splitting the data into different training and test sets.

all_test_accuracies_logistic <- matrix(nrow=100,ncol=1)for (split_number in c(1:100)){ train_ind <- sample.

split(diabetes$Pregnancies,SplitRatio = 0.

8) test_ind <- !train_ind logit_fit <- glm(Outcome ~ .

, data=diabetes[train_ind,], family="binomial") p <- predict(logit_fit,diabetes[test_ind,],family="binomial") probs <- exp(p)/(1+exp(p)) test_outcomes <- probs>0.

5 cm <- table(Actual = diabetes$Outcome[test_ind],Predicted = test_outcomes) accuracy <- sum(diag(cm))/sum(test_ind) all_test_accuracies_logistic[split_number] <- accuracy}Histogram of model accuracy for each of the 100 folds of logistic regression.

Mean (red) +- standard deviations (blue) for the KNN approach with k=9 is also shown.

Across all folds, we achieve a mean model accuracy of 77%, with performance ranging from 67–84% depending on the exact training-test split.

Logistic regression appears to be somewhat more accurate on this data set than k-nearest neighbors, even with the optimal choices of k (compare the filled distribution to the vertical lines showing the approximate optimal distribution for KNN).

decision treeFollowing the same logic as used for choosing logistic regression over linear regression, we’ll be building a classification tree rather than a regression tree.

Decision trees construct “nodes” at which data is separated, eventually terminating in “leaves” which give you the model’s assigned class.

Here again, I implemented 100 folds of training-test splits, and then assigned the value of each predictor to an output matrix to compare variable importance across folds.

all_dtree_importance <- matrix(nrow=8,ncol=100)bucketsize <- 10for (split_number in c(1:100)){ train_ind <- sample.

split(diabetes$Pregnancies,SplitRatio = 0.

8) test_ind <- !train_ind tree <- rpart(as.

factor(Outcome) ~ .

, data = diabetes[train_ind,],minbucket=bucketsize, model=TRUE)importance <- t(tree$variable.

importance) importance <- importance/sum(importance) all_dtree_importance[1,split_number] <- importance[,"Glucose"] all_dtree_importance[2,split_number] <- importance[,"BMI"] all_dtree_importance[3,split_number] <- importance[,"Age"] all_dtree_importance[4,split_number] <- importance[,"Insulin"] all_dtree_importance[5,split_number] <- importance[,"DiabetesPedigreeFunction"] all_dtree_importance[6,split_number] <- importance[,"Pregnancies"] all_dtree_importance[7,split_number] <- importance[,"BloodPressure"] all_dtree_importance[8,split_number] <- importance[,"SkinThickness"]}An example tree for this data set is shown below:One of the classification trees for the diabetes data set.

At each leaf, the top number and leaf color indicates the assigned class (blue: 0, green: 1).

The overall importance of blood glucose levels, BMI, and age are all readily apparent, one of the advantages of classification trees over other methods.

Mean importance of each predictor (% of model) +- standard deviation, across 100 splits of the data.

Across decision trees, blood glucose was consistently the most important variable in the tree (>40% of the model), with BMI and age coming in next at 10–15% and the remaining variables contributing <10%.

Overall, the accuracy of the decision tree model was around 74–75%.

random forestRandom forests, in contrast to the simple decision tree used above, aggregate multiple decorrelated decision trees in order to yield a prediction.

When building each set of decision trees, at each split only a sample of predictors is chosen as split candidates, rather than allowing the tree to choose from all possible predictors.

Therefore, each individual tree is less likely to select identical splits to the other trees.

In the case of the randomForest function that I use here, the default number of predictors that can be selected from at each split is floor(ncol(x)/3), or 2.

for (split_number in c(1:100)){ train_ind <- sample.

split(diabetes$Pregnancies,SplitRatio = 0.

8) test_ind <- !train_ind rf <- randomForest(as.

factor(Outcome) ~ .

, data = diabetes[train_ind,],ntree=100) train_accuracy <- sum(diag(rf$confusion))/sum(train_ind) cm <- table(predict(rf,diabetes[test_ind,]),diabetes$Outcome[test_ind]) test_accuracy <- sum(diag(cm))/sum(test_ind) all_train_accuracies_rf[split_number] <- train_accuracy all_test_accuracies_rf[split_number] <- test_accuracy importance <- rf$importance/sum(rf$importance) all_importances_rf[split_number,1] <- importance["Glucose",] all_importances_rf[split_number,2] <- importance["BMI",] all_importances_rf[split_number,3] <- importance["Age",] all_importances_rf[split_number,4] <- importance["Insulin",] all_importances_rf[split_number,5] <- importance["DiabetesPedigreeFunction",] all_importances_rf[split_number,6] <- importance["Pregnancies",] all_importances_rf[split_number,7] <- importance["BloodPressure",] all_importances_rf[split_number,8] <- importance["SkinThickness",]}In contrast to the simple tree classification method above, in the random forest model the importance of each variable is more evenly distributed.

While glucose is still the most importance factor in the model, in the random forest it accounts for only ~25% of the model.

This is intuitive, given that one of the ways in which random forests improve upon decision trees is by decorrelating the bagged trees from one another by making it less likely that every tree will use the same (strongest) predictor variable (in this case, glucose levels) as the first split in the tree.

Overall, this method also offers only a slight (but perhaps still meaningful) improvement over cross-validated decision trees.

While the simple decision trees had a mean accuracy of 74% and ranged from 65–82%, the random forest yielded models with a mean accuracy of 76%, ranging from 68–82%.

Therefore, as expected, the random forest method yielded somewhat more robust, invariant results.

gradient boostingFinally, we’ll fit another ensemble learning method to the data: gradient boosting.

Generally, gradient boosting refers to iteratively fitting a model to the residual of the previous model, thereby improving the overall model fit.

Gradient boosting “boosts” the decision tree model of classification by consecutively fitting gradually more complex trees to the data, and by using the residuals of the previous tree as a guide to the subsequent tree.

The gradient here refers to solving the problem of minimization through gradient descent, that is, finding the gradient at your current value and following the gradient in a decreasing direction.

install.

packages("gbm")library(gbm)all_gb_accuracies <- matrix(nrow=100)all_gb_relative_inf <- matrix(nrow=100,ncol=8)for (split_number in c(1:100)){ train_ind <- sample.

split(diabetes$Pregnancies,SplitRatio = 0.

8) test_ind <- !train_ind gb <- gbm(Outcome ~ .

, data = diabetes[train_ind,], distribution = "bernoulli") vals <- predict.

gbm(gb, diabetes[test_ind,],n.

trees=100) probs <- exp(vals)/(1+exp(vals)) class1 <- probs>0.

5 cm <- table(class1,diabetes$Outcome[test_ind]) gb_accuracy <- sum(diag(cm))/sum(test_ind) all_gb_accuracies[split_number] <- gb_accuracy s <- summary.

gbm(gb,plotit = FALSE) all_gb_relative_inf[split_number,1] <- s$rel.

inf[s$var=="Glucose"] all_gb_relative_inf[split_number,2] <- s$rel.

inf[s$var=="BMI"] all_gb_relative_inf[split_number,3] <- s$rel.

inf[s$var=="Age"] all_gb_relative_inf[split_number,4] <- s$rel.

inf[s$var=="Insulin"] all_gb_relative_inf[split_number,5] <- s$rel.

inf[s$var=="DiabetesPedigreeFunction"] all_gb_relative_inf[split_number,6] <- s$rel.

inf[s$var=="Pregnancies"] all_gb_relative_inf[split_number,7] <- s$rel.

inf[s$var=="BloodPressure"] all_gb_relative_inf[split_number,8] <- s$rel.

inf[s$var=="SkinThickness"]}Once again, we see that glucose levels are the overwhelming primary factor in determining the diagnosis of diabetes.

And in this case, the least important variables (insulin, blood pressure, and skin thickness) are minimized even more greatly than in previous models.

Overall, the gradient boosted model performed slightly better than the random forest, with a mean classification accuracy of 76% ranging from 68–83%, an overall increase of ~0.

6% relative to random forest.

.