How does this metric cope with that problem?Now, we offer some tips on how to pick apart good metrics from bad ones.
These are just our own thoughts and ideas, based on our past experiences with measuring models.
You should thoroughly evaluate your own situation to ensure your metrics are fit for purpose:Metrics meant for non-technical people need to be simple.
They can not be constructed in the slightest bit from obscure math equations or statistics.
Even MSE (mean squared error) is not easy to interpret.
I instead find that MAE (mean absolute error), sometimes normalized to percentages, allows my client to really visualize the performance of the model in their head.
Plain vanilla accuracy is better then obscure stats like F1 score or AUC for non technical peopleThey need to tell a story.
They need to allow the person to visualize in their mind people interacting with the model, both when it succeeds and makes mistakes.
This means the metric needs to be framed from the perspective of the end-user and how they will experience the model, rather than describing some internal property of the system.
They need to be in units people understand.
Measuring the image-recognition of a self-driving-cars accuracy in a per-frame basis just doesn’t make sense for a non-technical audience.
The metric is instead better constructed in minutes or hours, e.
number of minutes or hours without mistakes.
They generally should be more plain vanilla — less normalized and with class in-balances included.
The only normalizing you should do is putting a number in terms of percentages.
Non technical people don’t like numbers between 0 and 1, but percentages are easily understood.
Metrics that are meant to be used for internal R&D need to be stable and consistent:If you’re metric could change over time, without any change in code, then it’s not useful for R&D.
For example, if your training dataset is growing rapidly with time.
You need to lock in those testing, training, and validation sets before you proceed at the very least.
You can update them regularly when needed, but you should go for at least a month or two with numbers that are comparable to each other.
The longer the better.
Your metric needs to be relatively invariant to various properties of the samples in your dataset.
It should be normalized across things like the size or complexity of a sample.
Ideally, your metric is invariant to any known class in-balances in your datasetYour metric needs to be relatively “centered”, e.
it should not be really close to perfect, and it should change in a meaningful way as you make improvements.
If you’re already hitting close to 99.
7% but you’re job is still to make improvements on those 0.
3% errors, then you might be better off redefining your metric to be “harder”, so that improvements in those 0.
3% errors are magnified and more noticeable.
Instead of measuring a top-5 classification result, move down to top-1 measurement.
Conversely a metric that stays near 0 even when you’re making progress is also not very useful.
The granularity has to be right for the stage of research your at.
The metric should in general lean more towards the “hard” side.
A very high value on this metric should give you utmost and total confidence that your model is really good.
It should not leave any uncertainty or open questionsThe metric should provide a “clear answer”, in the sense that when you get the result on the metric, it shouldn’t leave you with uncertainty about whether your change was actually better or worse.
If you’re researching a specific sub-system or problem with your model, using a purpose built metric for that sub-system or problem is going to give you a more clear answer then your general purpose end-to-end metric.
If your results have a noise rate that is larger then your expected improvement, you need to increase your cross validation.
It’s much more acceptable to use obscure and constructed metrics.
It’s also more acceptable to have multiple metrics, metrics on specific subsets of the system, and to use graphs and charts instead of single metrics.
Metrics that are being used to measure the relative performance between two models are different then measuring its absolute performance:When measuring relative performance, we have much more freedom to choose different metrics.
Most metrics we can think of have a well formed definition of “better” or “worse” and we have reasonable expectations these correlate with end-user results.
When trying to measure absolute performance (e.
for communication to managers, external stakeholders, or just for our own evaluation prior to product launch), we need to be much more picky.
The metric needs to be very closely aligned to the “end-to-end” or “final-result” behavior of the system.
Much like the metrics for non-technical people, these metrics need to be constructed from the perspective of users.
They need to be done on the latest possible data, rather than on a training/testing set locked in months ago.
They need to reflect the current expected production performance of this model, if it was deployed to production today.
Metrics used for programmatic purposes, such as hyperparameter optimization, or early stopping are more tolerant of noise, but much less tolerant of poor metric construction.
Bayesian optimizer by definition are probabilistic and resistant to noise.
These algorithms can derive more inference and knowledge from having 5 times as many run, but with 5x noisy results, then they can with 5 times fewer runs but with 5x less noise.
The problem comes if your metric is poorly constructed — now your bayesian optimizer may be making your model more and more overfit in a direction that you care about, while appearing to make it better.
This generally occurs when you are close to the peak performance of the model, where a poorly defined metric prevents the model from hitting the correct peak.
Metrics for all purposes need to capture the dimensions along which your algorithm is meant to generalize:For systems that are meant to make predictions of future behaviour based on past data, it’s common to make the error of measuring an algorithms performance at generalizing across different samples broadly, without considering how those samples are constructed.
doing an 70/20/10 split across the entire dataset.
But often what we really care about is measuring its performance across the time dimension.
predicting how a user behaves in October based on their data from September and before.
Doing a standard 70/20/10 split gives your metric an unfair advantage, since the algorithm was trained on data from the future and measured on the past.
Your algorithm does not actually have any future data to learn from in the real operating environment, resulting in a form of leakage.
Again, these are just some tips and things we have noticed about picking good metrics.
There is no one answer to the question of how to measure a model.
It’s very easy to tweak your measurements just slightly and then get the outcome you desired.
If you’re model isn’t performing well enough to meet expectations, you can just tweak you’re measurement of it.
But this isn’t a sound scientific practice — its just self-delusion.
Poor use of metrics leads to false expectations and false hopes.
Select a metric and stick with itWhen I was first becoming sophisticated in my understanding of data science and artificial intelligence, I worked at a startup called Sensibill.
Our basic problem was to extract structured data out of images receipts.
But none of us had any data extraction expertise — we were just a bunch of full stack developers who had come together to build the solution.
Both the algorithms and the dataset they operated on were being cobbled together as we went.
One of my first experiences with the problem of measuring accuracy comes from those days.
Over the 2 ½ years that I spent developing that system, we kept introducing more and more metrics.
The managers would get confused when a metric indicated great performance, and their subjective evaluation of performance in the real world was not as good.
The response would be to introduce a new metric.
This would repeat every 3–6 months.
We would create specific subsets of our dataset used for testing (a bunch of them).
We would create different validation routines on results and measure our accuracy against them.
We would have different metrics in different environments (production vs staging vs development).
We had rolling average numbers and monthly numbers, recent-100 numbers, and numbers computed on the fly on testing sets.
We would have some metrics that reported excellent results, and other ones that seemed to indicate that the model was near total shit.
To me, it became metric salad, and none of us really knew how well the system behaved for our end users.
We also didn’t have any clear idea of what was making the system better or worse.
Looking back on those years with more experience, I came to understand that the problem wasn’t that we weren’t trying to measure the accuracy.
We certainly devoted a lot of effort to measurement of the system.
Nor was it that we didn’t know how to measure our system — we certainly had a lot of expertise on that, and some of the metrics we did have I would now consider the “correct” ones that we should have trusted (although I’d make some improvements).
The reason it wasn’t working is that we were fundamentally conflating the different purposes for the different metrics.
We knew the questions we wanted answered.
But we weren’t carefully selecting which of the metrics available to us would give us those answers.
Additionally, our metrics weren’t really fit for purpose and always had some sort of flaw versus what we cared about.
This problem lead to a breakdown in communication between different teams and departments as people’s expectations did not align with what they were seeing in the application.
And this wasn’t always in the negative direction — certain users experienced unusually good results from the application — however this only resulted because they uploaded short, easy receipts.
It was a biased sample.
This mismatch between expectation and reality then lead to more effort being spent developing ever more metrics.
If you have carefully designed your metric, and considered all the problems associated with it; if you are confident it will actually answer the question you want to answer, then stick with it.
Work to improve your model using it.
Communicate it to people as appropriate.
When you add new metrics, consider it thoughtfully and don’t let yourself become overwhelmed by metric salad.
Have a few specific, thoughtful metrics that are designed for the requirements you have.
Try to upgrade, improve or replace existing metrics, rather then create additional ones.
Create temporary metrics to solve specific problems, and then discard them once the problem is solved, or at the very least, confine them to automated tests.
Keep your eyes on the prize.
ConclusionIn this article, we have discussed a process for measuring your model better by going through a thoughtful exercise on what your needs for measuring are.
The exercise should not take more than a couple of hours, but exhausting that brain energy and going through the process thoroughly may lead you to better performing models.
The process starts by considering the reason for measuring.
You may in fact find you that you have multiple distinct purposes.
At the very least, everyone will experience a dichotomy between the external communication purposes of metrics, and the internal R&D purpose of metrics.
We then proceed to brainstorm a bunch of candidate metrics.
The goal should be to not just presuppose that some commonly used method of measurement is correct for us.
We should really try to consider different methods of measurement, and potential modifications that we can make that will serve the purpose of the metric better.
Then only after brainstorming multiple candidates options, we evaluate the pro’s and con’s of each for our purpose.
Finally, after going through this process and actually selecting a metric (or multiple metrics) that serve the purposes that you have, you should stick with them.
Having a set of metrics that can serve as a reliable benchmark over time is good for everyone — stakeholders, managers, and researchers all benefit.
Thank you for reading through this entire series of articles on measuring accuracy.
Originally published at www.