I would argue the validation loss is the most important.
Validation loss is how we decide “model A is better than model B”.
This is our lodestar, and it will guide every modeling decision we make.
Training loss is a tool, a tactical necessity to lower the validation loss.
So we’d better be damned right in choosing the right goal, right?Classification Losses — 3 Common QuestionsLet’s stick to binary classification for now, just to have a motivated discussion.
You’ve decided you’re trying to recommend Medium posts to people.
You’re only going to show the user one recommendation, and they’ll either accept it or ignore you and move on to the next website.
For that, you’re building a classifier:Classifier(person, article)-> click probabilityWhat should you optimize for?Log LossThe king of classification.
This is the loss we usually optimize for in a classification model training.
As a performance metric, log-loss is a measure of how well calibrated you are in predicting probabilities of a class.
In our example, the metric measures how good we are in predicting the click likelihood.
If you said that something has a 0% chance of happening and it actually did happen then you’re doing a terrible job in estimating probabilities — the log loss will be infinity.
Intuition: Measure of how good you are in predicting probabilitiesEdge case: loss goes to infinity if the model predicted probability 0.
0 and the label was 1 (or vice versa)Making sense of the number: In binary prediction with 50–50 prior and a clueless classifier, you should see loss = ln(0.
5) = 0.
With N many classes and flat prior, you sohould see loss=ln(1/N)=-ln(N).
Logloss is notriously hard to get an intuition for.
A useful trick for binary classification is taking e^(loss) .
The number you get is approximately your probability of predicting the right class.
Definition: loss=-sum(log(p_i) * y_i) where p_i is your predicted probability for a certain class i, and y_i is the label for that class.
Is this a good measure for a recommendation engine?Not really.
We don’t really care how likely the user is to click on our returned result.
We want to put the likeliest article at the top.
Predicting the click probability is a related problem, but it’s not the same problem.
As an example, maybe we have the signal that a certain user is a click-fiend, and across the board she’s 10x likelier to click anything compared to other users.
This information is useless for returning the best result for that user — increasing our prediction 10x for all recommended items doesn’t change their order, but lowers our log-loss, since it impacts our click probability by a lot.
When is this a useful metric?When you care about the probability of an event happening, as opposed to when you’re ordering recommendations.
As an example, let’s say I’m trying to predict the chance of rain, log-loss would be a very useful metric because it quantifies how good a job we’re doing in predicting the probability that it rains.
AccuracyThis one is quite intuitive.
We threshold our results in the way we intend to use our model (for example in our case, we’ll take the top scoring medium article amongst all candidates, rather than threshold), and ask whether that top result got clicked.
A more accurate name for what we’re doing in this context would be to call the metric top1 click-through rate, as this sets the stage for use revisiting our product and recommending the top K results instead of just 1.
Intuition: this is a direct measure of “how often you made the right guess”.
Making Sense of the number: Accuracy of 99% might sound like amazing performance, until you consider that the flat prediction of “the user isn’t going to click this article” is a prediction that’s already correct 99.
9% of the time.
A baseline model for accuracy is the appearance frequency of the most common class.
In a 50–50 binary problem, this comes out to 50%.
But in guessing whether there will be a hurricane today, 99.
99% is the absolute worse you can do, so this number must always be compared against a baseline.
Is this a good measure for a recommendation engine?Yeah, this loss measures the exact use case — whether the top recommendation got picked.
But notice a the weakness of this metric.
This metric is quantized, which means even if the model gets better at its job, the accuracy might not move at all.
Let’s say our model improved in ranking article recommendations on medium, but the chances of being exactly right and guessing the one article the user ended up clicking out of millions of candidate is very close to 0.
In that case, accuracy as a metric wouldn’t move at all even as we improved our model and made better recommendations.
Which brings us to the next loss definition.
AUCIt boggles my mind why people define AUC in a way that’s actively hostile for any human to understand.
But just for completeness we’ll start with the dry definition.
AUC is defined as Area Under the Curve, which is the integral of the curve that you plot out on a true-positive-rate vs false-positive-rate curve.
Here’s a typical visualizationIf you ever want a confuse someone, go for the classic AUC definition.
I believe the above definition is useless, because it gives you no understanding of what AUC of 0.
So let’s try another definition, which is mathematically equivalent but I find much more relatable:In binary classification, AUC of 0.
9 means that given a negative sample and a positive sample, 90% of the times your classifier would predict a higher score for the positive sample than it would for the negative sample.
Intuition: Measure of how good you are ordering positive classes above negative classes Making sense of the number: AUC is the likelihood that your classifier will give a higher prediction to a random positive sample above a random negative sample.
5 is as bad as it gets!Definition: hereIs this a good measure for a recommendation engine?Yeah, but let’s think what this is missing.
We said at the outset that we only show the top article to the user.
So really, this doesn’t capture the way our model is going to get used in this context.
On the other hand, unlike top-1 accuracy, this metric is sensitive to minor improvements in the model, and isn’t plagued by the same pathology as accuracy metrics.
And many moreFor the specific case of recommendation problems, there’s a vast list of metrics that are specifically designed for that case: NDCG, GMAP, MRR and the list goes on.
But the purpose of this article isn’t to go deep into the specifics of recommendation engine metrics, but rather to discuss the most common metrics that are useful across the board — and hopefully give a bit of intuition in how to approach the problem of what we measure.
And lastly, a personal observation about measuring progress and outcomes.
As scientists we hate to move the goalposts.
It’s much cleaner to have a single test set, with a single metric, and progressively get better at it.
That rarely happens.
Reality is more interesting than that.
Expect to change your test set, redefine your validation metric, exclude outliers and add new observations to your test set.
Expect to move your goal posts until they actually reflect what you want to accomplish.
.. More details