Rudimentary MLRudimentary k-Nearest NeighborsThe most intuitive ML modelJovan MedfordBlockedUnblockFollowFollowingMay 20Sometimes in data science you find yourself picking up a missile launcher in order to kill a mere ant.
In this case though, we are going to be discussing what is probably the most intuitive machine learning algorithm.
Do not be fooled however, in spite of its simplicity, the kNN algorithm could still be extremely effective in some instances.
Some quick notes:Firstly, kNN is a non-parametric algorithm — this means that it doesn’t assume that your data follows any particular distribution.
Another cool thing is that it can perform classification and regression tasks, both of which we’ll be touching on in this post.
IntuitionFirst we are going to be using a simple representation of a data set to illustrate how we would use kNN to perform a Classification task.
Let Magenta be type A and Black be type B.
These points would make up the given data.
Let Blue be the data point that we wish to classify based on our given data.
To do this, we are going to choose a value k which would be the number of points that are “most similar” to our new data point.
In this 2 dimensional example, we would choose our similarity measure to be the euclidean distance.
So if we let k=2 for example, then we would be looking for the 2 points that are most similar to our new data point.
If we then put that in terms of distance then, we would be looking for the 2 closest points.
The red line touches precisely 2 points.
After this we calculate the proportion of data points per type in the set of our k nearest points.
We then choose the type with the highest proportion and assign our new data point to this type.
In our case we have that the 2 nearest points were both black.
Prop of black = 2/2Prop magenta = 0/2So we would classify our new data point as being of Type B.
On the other hand if we had chosen k to be 8 for instance we would have:Prop of black = 3/8Prop magenta = 5/8In that case we would have assigned our new data point to Type A.
Regression SettingIn this case we are largely following through on the same idea.
For instance, if we were to predict the price of a house using kNN we may have a set up such as the following:We want to find the k most similar points, and use the average of the response values for these points as the predicted value for the new data point in question.
If we let k be 6 and we wanted to predict the price of a 2-bathroom house using this data set, then the most similar data points would first be the other 2-bathroom houses and then the 1 and 3-bathroom houses.
Indeed, the corresponding predicted price value is the average of the prices of the magenta dots.
The resulting prediction is the blue dot.
The Choice of kAs we have seen, the choice of K really does matter.
Note how the value of K changes the performance of the model without any changes in data.
Any such variable is called a hyperparameter.
Indeed hyperparameter optimization is worthy of a whole post in itself.
For the purpose of this post however, we’ll only be mentioning the grid search algorithm.
It is as follows:Choose a set of hyperparameter valuesTrain a model with each of these valuesEvaluate the performance of each model and choose the value that resulted in the least amount of error.
Here’s a generic picture of what that process may look like over an increasing choice of k.
Generic Graph Showing How Test Error(Magenta) and Training Errors(Blue) Change over 1/KFinal ThoughtsFirstly, kNN suffers from the curse of dimensionality, so as the number of input variables grow kNN may start to struggle increasingly.
It is worth considering dimensionality reduction techniques to combat this problem.
Additionally, kNN needs data that is properly scaled.
In other words, you may have to normalize your data as a difference in units between measurements can seriously damage the effectiveness of your model.
One other thing to think about is the fact that parametric methods like linear regression will tend to outperform non-parametric methods like kNN as long as the true relationship in the data actually follows the assumptions of that parametric model.
Funny enough, parametric methods may even perform better than non parametric methods on some data that don’t follow all the assumptions — so don’t rule them out!All in all, kNN is relatively simple to understand and implement therefore making it a great method to use as a benchmark for other techniques.
I hope you enjoyed the read and found it useful in some way, whether it was in learning something new or just refreshing what was already in there.
I myself am a final year Mathematical Finance Undergrad so I’m every bit of a learner as I am a teacher.
That being said I’m most definitely open to any suggestions, corrections or any general feedback that comes to mind as you peruse through the post.
Feel free to connect with me on LinkedIn —Jovan Medford – University of Waterloo – Canada | LinkedInView Jovan Medford's profile on LinkedIn, the world's largest professional community.
Jovan's education is listed on…www.
comOr you can also follow me on Twitter if you prefer —Jovan Medford (@JovanMedford) | TwitterThe latest Tweets from Jovan Medford (@JovanMedford).
A Little Something Special????.| Writer on Math, Finance and Tech…twitter.
com.. More details