Model Selection, Tuning and Evaluation in K-Nearest NeighborsHarrison HardinBlockedUnblockFollowFollowingJun 26It’s a beautiful day in the neighborhoodThe core of the Data Science lifecycle is model building.
Although relatively unsophisticated, a model called K-nearest neighbors, or KNN when acronymified, is a solid way to demonstrate the basics of the model making process …from selection, to hyperparameter optimization and finally evaluation of accuracy and precision (however, take the importance of accuracy with a grain of salt).
If Occam’s Razor has taught us anything, it’s that simplicity is not a bad thing, so let’s dive into looking at an algorithm that is often a component in building other models in Machine Learning which are quite sophisticated.
Model SelectionFor this walkthrough, I’ve already selected the model as KNN.
In choosing a model, do what the best scientists always do first: look to the data.
Do you have a little or a lot of it?The best model selections will have a good trade off between bias and variance, and ideally minimize both.
Since I’ve already mentioned the term Machine Learning, it is important to distinguish that I am not referring to algorithmic bias here, but instead statistical bias.
Algorithmic bias is a hugely important and active topic of discussion, but it is outside the scope of my blog post.
Statistical bias involves the difference between expected and true values of a parameter, and variance refers to how much a model’s predictions fluctuate after multiple repetitions of the learning process on a subset of the data being studied called the training set.
This brings us into a discussion about goodness of fit.
Before getting into the specifics of how KNN works, if we examine the following image the black line is an example of a model with good fit in the abstract sense of identifying blue dots from red dots.
Although not pictured, an underfit model would be a line that runs perfectly diagonally from the top left to the bottom right (underfitting meaning high bias).
The green line is arguably perfect, it classifies all blue dots together and the same for red dots.
However it represents the definition of overfitting, and would not be able to generalize well to new data (more on this in Section III).
In essence, bias and variance taken together will make up the amount of error in a given model that it is possible to reduce:Reducible Error = Squared Bias + VarianceNo matter how perfectly a model is able to minimize bias and variance, there is a certain degree of inescapable error or noise that will always be present in our entropic universe.
This is known as irreducible error.
At its core, the K-nearest neighbors algorithm is really pretty intuitive.
How would this point in space that is equally distant from the two apples and three oranges be classified?.Citrus tips the scale, and now we have four oranges.
Model TuningOften in modeling, both parameter and hyperparameter tuning are called for, but with KNN we need only concern ourselves with the second.
What distinguishes them is whether they come before (hyperparameter) or after (parameter) a model has been fit.
KNN is a relatively simple classification tool, yet it’s also highly effective much of the time.
It gets bandied about that in about a third of all grouping cases, it’s the most effective categorizer.
A third!.This model may be small, but so too is it mighty.
There are different iterations of the algorithm in various programming languages, but I’ll be discussing scikit-learn’s here, which is in Python.
The base algorithm uses Euclidean distance to find the nearest K (with K being our hyperparameter) training set vectors, or “neighbors,” for each row in the test set.
Majority vote decides what the classification will be, and if there happens to be a tie the decision goes to the neighbor that happened to be listed first in the training data.
For me, it’s the best algorithm for visualizing what a hyperparameter actually does.
Pop quiz!.At each of the values of K in the graph, the question mark in the middle would be classified as a blue circle.
But what about when K keeps increasing?.Leave your answer in the comments below for a chance at winning the most valuable prize of all, knowledge!The image below raises an interesting question.
Clearly, when K is equal to 11, the puzzle piece is a smiley face.
But what about when k is four?.According to the documentation, if two groups of neighbors have identical distances but different labels, the result will depend on the ordering of the training data.
What is the optimal K for solving how this piece of the puzzle should be classified?Be aware that in different iterations of the algorithm, this can vary.
For example, in R’s KNN classification system, ties are broken at random.
The famous (read: infamous) Iris data set, a classic in the Statistical canon, is an apt demonstration of how this algorithmic tool can be put to use.
A scatterplot visualization of the four features being analyzed should make it clear that KNN would be able to easily distinguish the three separate species in 3 out of 4 of the comparisons.
One feature matchup is a little more mixed, at least between two of the species being compared, but even in the case of comparing sepal length to sepal width, KNN would be able to distinguish the three species from one another to varying degrees of success, depending on what we set K as.
It is also readily available, should you wish to sandbox what is discussed here and even level-up to ensembling a few methods together.
You could be winning Kaggle competitions before you know it!So now that we have our model picked out and have a better idea of what the hyperparameter means in the context of nearest neighbors, how do we optimize K?.Learning curves.
Seaborn Visualization of a KNN learning curvePictured above we see a learning curve which comes from an excellent Kaggle page that looks at KNN in the context of diabetes prevalence amongst Pima Indians.
We can see clearly that the maxima of the Test Score, 11, denotes the optimal value for the algorithm, K = 11.
This circles back to our discussion about bias and variance.
Too much of door number one (bias), and you would see a learning curve indicative of a simplistic model: low test and low training scores which are also very similar in value.
An excess of door number two (variance) and there will be a noticeable difference between training and test score.
A good balance of both denotes a model which is both accurate and precise.
A model could even be described by a fourth possibility: neither accurate nor precise.
This would be the case in a high bias, high variance situation.
Answer for yourself: how does the learning curve rate here?III.
Model EvaluationIn science, evaluating a model basically means seeing how well it handles new data.
This was hit upon in a general sense when we looked at goodness of fit (remember our red and blue dots).
To evaluate K-nearest neighbors in the context of Machine Learning models at large, we need to weigh some of its advantages and disadvantages.
KNN is used as the building block for many other more advanced algorithms, a phenomenon I briefly touched on earlier that’s known as ensembling, or blending.
You can also fine-tune the algorithm with a few variations common to other areas of ML.
What distinguishes unsupervised from supervised K-nearest neighbors is to what end it is being used — the former being for clustering and the latter classification.
If we are talking about unsupervised KNN, you can switch between a brute force approach, ball tree, KD tree, or even leave it up to the algorithm itself to determine the best way to cluster (auto).
I’d reguard this customizability as a point in favor of KNN, as it allows you the flexibility to handle both small and large data sets.
The division typically occurs on a number of samples close to about 30: below this number, roughly speaking, a brute force strategy is better.
Above it, the tree-based approaches are the way to go.
Too far above it, though, and the time complexity involved is no longer tenable.
That is to say, as N (the number of samples) goes up, the calculation time of KNN increases exponentially, quickly diverging from what can be considered useful.
One of the other notable cons here has to do with the structure of the data you want to work with, and is known as the curse of dimensionality, although this ailment can be addressed by using scikit-learn’s Neighborhood Components Analysis, or NCA for short.
This is also a supervised (learned) distance metric algorithm aimed at improving the accuracy of KNN’s classifications when compared to using the default metric, Euclidean distance.
It is derived from a broader algorithmic strategy to deal with dimensionality issues called a Principal Components Analysis, or PCA.
When NCA is used in conjunction with the K-neighbors classifier, it is elegant, simple and powerful; no complications from additional parameters requiring fine-tuning.
There is also the tremendous added benefit of being able to handle multi-class problems without any increase in model size.
Further refinements, such as the Nearest Centroid classifier (yet another supervised application of this model), only drive home the point that this family of algorithms is foundational in ML.
We are pushing into increasingly advanced territory, and I would refer you to sk-learn’s documentation linked below to delve deeper.
Works ReferencedStep by Step Diabetes Classification-KNN-detailed, Shruti Iyyer.
Model evaluation, model selection, and algorithm selection in machine learning, Sebastian Raschka.
A “short” introduction to model selection, David Schönleber.
A Complete Guide to K-Nearest-Neighbors with Applications in Python and R, Kevin Zakka.
Scikit-Learn’s K-nearest neighbors algorithmshttp://www.