Different models are best for different problems, and often the best model is some clever combination of these models.
Model VotingFor ease of discussion, I will talk about a binary classification- wherein we are trying to predict the answer to a yes or no question, e.
“Will person X survive on the Titanic?”.
One common method is hard voting: each model has 1 vote, and votes for ‘yes’ or ‘no’, option with the most votes is the prediction.
While this is an elegant option, models have different strengths and weaknesses, so if two models are 51% sure of ‘yes’, and one model is 99% sure of ‘no’, it does not really make sense to predict ‘yes’ because of 2 out of 3 votes.
This brings us to soft voting, where each model gets extra/fewer votes based on its level of certainty.
A model with 50% certainty gets the fewest number of votes, and a model with close to 0 or 100% get the most number of votes.
People also give extra weight to models that perform well in general, and it is also advisable to penalise models whose predictions are highly correlated to those of other models.
Potential ProblemsThe k-nearest neighbours (KNN) algorithm compares the data point to the k most similar data points, and uses these ‘neighbours’ to make a prediction for the data point in question by voting.
‘k’ is a number specified by the user.
If k is set to 1, then the model mimics the outcome of the nearest neighbour.
When using sk learn’s “predict proba” method, this will set every probability to 100% or 0%, giving your KNN algorithm many votes in the soft voting algorithm.
What is the problem here?.It is that the model prediction of 100% does not mean that there is a 100% chance of a ‘yes’.
In fact for such a simple model, the probability is likely to be much lower than 100.
It makes sense to use the amount of certainty we ‘actually’ get from the model rather than how certain the model thinks it is.
Proposed SolutionFor the KNN algorithm, if K is small, there are at most K+1 different predicted probabilities.
If your dataset is big enough, you can treat each of these individually and see the likelihood of ‘yes’.
An example is shown below for k=5:We find the ‘actual probability’ by doing a Bayesian estimation, which is seeing what information we have, and how likely the target is with that information.
So first we would look at all the data points for which KNN predicted 0, seeing the probability of ‘yes’ in those points, then doing the same for all the other values.
This ‘actual’ probability is the real degree of confidence we have from our model, and is a sensible weight to use when soft voting.
While this works well for KNN, what do we do for something like a logistic regression, where each datapoint has a different predicted probability?.I would suggest using buckets, e.
0–10% predicted, 10–20% predicted, and so on, and then looking at the likelihood of yes in each of these buckets.
The smaller the buckets are the more precise our predictions will be, however we should be careful to not have too few points in a bucket, as the predictions then become unreliable within that bucket.
In some situations, such as with decision trees, it is probably a good idea to apply some Laplace smoothing to the predicted probabilities before bucketing them, to get a better idea of the model’s confidence.
However, I will not say more about this topic, as it is complicated enough to require its own blog.
ConclusionWith bucketing, and making sure that there are enough points in each bucket, along with corrections for models that are highly correlated with each other, I feel that this is a sensible approach to soft voting.
If this has been suggested before, or you see a major flaw with it, please reach out to me.