Linear Discriminant Analysis (LDA) 101, using RPeter NistrupBlockedUnblockFollowFollowingJan 31Decision boundaries, separations, classification and more.
Let’s dive into LDA!Marcin Ryczek — A man feeding swans in the snow (Aesthetically fitting to the subject)This is really a follow-up article to my last one on Principal Component Analysis, so take a look at that if you feel like it: https://towardsdatascience.
com/principal-component-analysis-pca-101-using-r-361f4c53a9ffIf not just keep reading, we’ll tackle a case without PCA first and then follow up with LDA on PCA-’tranformed’ data afterwards.
As a teaser:AUC for PCA transformed data is higher than the raw 30-dimensional data!“Performing dimensionality-reduction with PCA prior to constructing your LDA model will net you (slightly) better results!”SetupFor this article we’ll be using the Breast Cancer Wisconsin data set from the UCI Machine learning repo as our data.
Go ahead and load it for yourself if you want to follow along:wdbc <- read.
csv", header = F)features <- c("radius", "texture", "perimeter", "area", "smoothness", "compactness", "concavity", "concave_points", "symmetry", "fractal_dimension")names(wdbc) <- c("id", "diagnosis", paste0(features,"_mean"), paste0(features,"_se"), paste0(features,"_worst"))The code above will simply load the data and name all 32 variables.
The ID, diagnosis and ten distinct (30) features.
From UCI:“The mean, standard error, and “worst” or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features.
For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.
”Why LDA?Let’s remind ourselves what the ‘point’ of our data is, we’re trying to describe what qualities in a tumor contributes to whether or not it’s malignant.
In other words: “If the tumor is – for instance – of a certain size, texture and concavity, there’s a high risk of it being malignant.
”This is really the basic concept of ‘classification’ which is widely used in a wide variety of Data Science fields, especially Machine Learning.
Now, even if you haven’t read my article about Principal Component Analysis I’m sure you can appreciate the simplicity of this plot:2D PCA-plot showing clustering of “Benign” and “Malignant” tumors across 30 features.
What we’re seeing here is a “clear” separation between the two categories of ‘Malignant’ and ‘Benign’ on a plot of just ~63% of variance in a 30 dimensional dataset.
Simply using the two dimension in the plot above we could probably get some pretty good estimates but higher-dimensional data is difficult to grasp (but also accounts for more variance), thankfully that’s what LDA is for, it’ll try to find the ‘cutoff’ or ‘discision boundry’ at which we’re most successful in our classification, so now we know why, let’s get a better idea of how:Consider only two dimension with two distinct clusters.
LDA will project these clusters down to one dimension.
Imagine it creating separate probability density functions for each class / cluster, then we try to maximize the difference between these (effectively by minimizing the area of ‘overlap’ between them):From Sebastian Raschka: https://sebastianraschka.
htmlIn the example above we have a perfect separation of the blue and green cluster along the x-axis.
This means that if future points of data behave according to the proposed probability density functions, then we should be able to perfectly classify them as either blue or green.
Alright enough of this, let’s get into R and try it out!LDA on raw data (All 30 dimensions)Alright on with the show, let’s start by defining our data:wdbc.
data <- as.
data) <- wdbc$idwdbc_raw <- cbind(wdbc.
numeric(wdbc$diagnosis)-1)colnames(wdbc_raw) <- "diagnosis"What this does is it simply removes ID as a variable and defines our data as a matrix instead of a dataframe while still retaining the ID but in the column-names instead.
Now we need to define a train- / test-split so that we have some data we can test our model on:smp_size_raw <- floor(0.
75 * nrow(wdbc_raw))train_ind_raw <- sample(nrow(wdbc_raw), size = smp_size_raw)train_raw.
df <- as.
df <- as.
frame(wdbc_raw[-train_ind_raw, ])This will make a 75/25 split of our data using the sample() function in R which is highly convenient.
We then converts our matrices to dataframes.
Now that our data is ready, we can use the lda() function i R to make our analysis which is functionally identical to the lm() and glm() functions:f <- paste(names(train_raw.
df), "~", paste(names(train_raw.
df)[-31], collapse=" + "))wdbc_raw.
lda <- lda(as.
formula(paste(f)), data = train_raw.
df)This is a little lifehack to paste all the variable names instead of writing them all manually.
You can call on the object ‘wdbc_raw.
lda’ if you want to see the coefficients and group means of your FDA if you like, but it’s quite a mouthful so I wont post the output in this article.
Now let’s make some predictions on our testing-data:wdbc_raw.
predict <- predict(wdbc_raw.
lda, newdata = test_raw.
df)If you want to check the predictions simply call ‘wdbc_raw.
predict$class’EvaluationThis is the exciting part, now we can see how well our model performed!### CONSTRUCTING ROC AUC PLOT:# Get the posteriors as a dataframe.
posteriors <- as.
predict$posterior)# Evaluate the modelpred <- prediction(wdbc_raw.
perf = performance(pred, measure = "tpr", x.
measure = "fpr")auc.
train <- performance(pred, measure = "auc")auc.
train <- auc.
perf)abline(a=0, b= 1)text(x = .
25, y = .
65 ,paste("AUC = ", round(auc.
train[],3), sep = ""))And here we go, a beautiful ROC plot!.Here I’ve simply plotted the points of interest and added a legend to explain it.
Now the point I’ve plotted as the “optimal” cut-off is simply the point in our curve with lowest euclidean distance to the point (0,1) which signals 100% True Positive Rate and 0% False Positive Rate, which means we have a perfect separation / prediction.
So what does this mean?.This means that depending on how we want our model to “behave” we can use different cut-offs.
Do we want 100% true positive rate at the cost of getting some false positives?.Or do we want 0% false positives at the cost of a love true positive rate?.Is it worse to get diagnosed with a malignant (cancerous) tumor if it’s actually benign or is worse to get told you’re healthy if it’s actually malignant?Our “optimal” point has a TRP of 96.
15% and a FPR of 3.
3% which seems decent but do we really want to tell 3.
3% of healthy people that they have cancer and 3.
85% of sick people that they’re healthy?Please keep in mind that your results will most definitely differ from mine since the sample method to do train- / test-splits are random.
Let’s take a look on LDA on PCA transformed data and see if we get some better results.
LDA on dimensionality reduced data (6 dimensions)Please follow my article on PCA if you want to follow along:Principal Component Analysis (PCA) 101, using RImproving predictability and classification one dimension at a time!.“Visualize” 30 dimensions using a 2D-plot!towardsdatascience.
comRight we have our PCA with 6 components, lets create a new dataset consisting of these as well as our response:wdbc.
pcst <- wdbc.
pcst <- cbind(wdbc.
pcst) <- "diagnosis"We’ll be using the EXACT same methods to make our train- / test-splits so let’s skip ahead to the LDA and prediction:wdbc.
lda <- lda(diagnosis ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6, data = train.
predict <- predict(wdbc.
lda, newdata = test.
df)Now we can simply create our ROC plot in the same manner as before and see what kind of results we get:Right off the bat we’re getting some better results but this could still be pure luck.
We have to run some simulations and compare the two!ComparisonNow depending on your “luck” you might see that the PCA transformed LDA performs slightly better in terms of AUC compared to the raw LDA.
However, this might just be a random occurance.
So let’s do a “quick” T-test on the means of a 100.
000 simulations of the PCA transformed LDA and raw LDA:t.
numeric(AUC_pca))AUC_raw and AUC_pca is simply arrays with the resulting AUC score from each iteration I ran.
I wont bore you with the simulation part since it’s a big chunk of ugly code so just trust me on this!.Also look at the df-count in the test results below:A very low p-value, this means that there’s a statistical difference between the two!.So even though their means only differ by 0.
000137 through 100.
000 trails it’s a statistically significant difference.
In other words:“Performing dimensionality-reduction with PCA prior to constructing your LDA model will net you (slightly) better results!”Because every article needs a fancy plot:That’s it!Follow me on Twitter to stay up to date:Peter Nistrup (@PeterNistrup) | TwitterThe latest Tweets from Peter Nistrup (@PeterNistrup).
Machine Learning, Statistics and Data Analytics.
Follow me on…twitter.
comThanks for reading!.