Why Measuring Accuracy Is Hard (and important!) Part 2

What if I use this split instead:The naive way might be just to shuffle your dataset randomly and split it 3 ways.

But when you train your model, you find that every time you do this, you get different results, even with no changes to your model:Frustrated, you might believe its because the dataset is very unbalanced from one run to the next.

Maybe you may decide to use the labels in your dataset to construct a training, testing, validation sets that have even numbers of each of your target labels.

Effectively dividing your dataset into a bunch of sub-datasets by each label, splitting those datasets 3 ways each randomly, and then recombining them back together to get.

Now your dataset is class balanced but still randomized.

Now you have reduced the noise, but you still get problems:You’ve reduced the spread, but there is still noise in your measurement.

You resolve that there are actually harder and easier samples in your dataset for each label, which are still getting shifted around.

But what are you to do?You might resolve to using a cross-folding method, where, say, you take the average of all 5 possible 80/20 splits within your dataset.

This ensures every sample shows up in in the testing set at least once.

But this method can be costly for many deep learning models, where GPU power is scarce and you are starving to run as many experiments as possible.

Maybe you just lock in a fixed testing and training set and only measure against that.

But this can be difficult in dynamic real-world environments where our datasets are continuously evolving and getting larger, effectively requiring us to maintain a database which tags entries as either training, testing or validation.

The extra infrastructure required to maintain a good, stable, consistent, noise-free measure of the accuracy of the model can grow quickly depending on how much you want to reduce the noise.

Noise in the datasetAnother big source of problems that I’ve seen is noisy datasets.

Noisy and error-filled datasets are the bane of my existence.

Having dealt with third-party data outsourcing firms, I’ve seen first hand how difficult it is to get high quality, clean labelled data.

The problem is inherent in the task — labelling data sucks.

No matter how many fancy tools we develop to assist in labelling, it still remains a mundane task of data entry.

Its error prone by nature.

People take many steps to try and solve this problem:Maybe we gamify the process, trying to encourage users to “have fun” while providing labelled data.

We might be able to get our users to provide labelled data for us as part of their work, when an AI makes “suggestions” that they can then correct and modify.

The user is already encouraged to be careful because of the value of the work to themWe might use techniques like double verification (sending the same item to multiple people to generate a consensus), or random sampling and verification by a managerSome people even use AI analysis techniques which try to identify outliers in the datasetNo matter how we do it though, the datasets we are working with are working with are often noisy.

On the machine learning side, as long as the dataset isn’t too noisy, we can often still teach the algorithm to learn the pattern.

Deep learning is especially good at generalizing despite a lot noise.

But on the measurement and validation side, noise in the dataset gives us a few inherent problemsWe can’t know if our accuracy metric improve because we are overfitting on some poorly labelled samples, or if the model successfully generalized.

This means a larger increase is accuracy is needed to confirm an improvementThere is no way to achieve perfect accuracy, even if our model has perfectly generalized the data.

The best possible accuracy is some unknowable, immeasurable number.

If your dataset is really noisy (+10%), this can make you really spin your tires when your sitting at 89.

7%, trying to figure out more ways to improve your model.

Your model could actually be 100% accurate on real world data, and you would have no way to know it.

Consistent Errors in the datasetProbably even worse then having noise or random mistakes in your dataset, are the consistent errors that are made by your annotation team.

If there are errors made in consistent ways, then the machine learning algorithm may attempt to learn those patterns of errors as part of its overall learning.

I have been burned by this on several real world projects.

At a recent project we executed with a large Canadian bank, we had gotten the provided data annotated by a third party.

We did not perform sufficient oversight and testing.

The samples we received were good, but we failed to check the full dataset.

We went ahead and trained our model, and eventually hit very high accuracies.

But when presenting the algorithm to the client, they quickly pointed out that the final results appeared to be filled with errors.

When diving deep, we realized that not only had much of the dataset been annotated incorrectly, it was consistently annotated incorrectly in the same ways.

Therefore, it was still possible for us to achieve a high degree of accuracy by every measurement we had available, and for the model to still be wrong.

The model had indeed learned the patterns, it learned precisely the patterns of errors which the people annotating the dataset had made.

It was clear that we could not just check the ‘provided samples’ that are emailed to us during the process, as these might have been prepared and scrutinized by managers who correctly understand the requirements.

We needed to analyze and validate the final samples that were put together by all of the various low-paid data entry people by this company, who may or may not have received sufficient instruction to annotate the data correctly.

In my opinion, this is one of the hardest and most frustrating aspects of doing real world AI projects.

You can have great accuracy using well validated techniques of measuring accuracy, and still ultimately be wrong.

Multiple Metrics in your pipelineAnother problem that sometimes comes up when measuring accuracy is that there becomes too much to measure.

You might have a dozen different metrics representing different stages of your system.

This issue came up for me when building a text extraction engine recently.

The model took as input a word and its surrounding context, and produced a classification for that word.

However, our pipeline consisted of two stages.

The first machine learning algorithm just decided between null or extract, and the second stage made the final classification.

In practice we landed on a four layered system, but the 2 layered system will suffice for explanation.

The setup looks like this:So now we have two different machine learning algorithms with two independent measurements of accuracy.

We also have several different failure cases:False Positives at #1 (extract when should be null)False Negatives at #1 (null when it should be extract)False Positives at #2 (a category when should be null)False Negatives at #2 (a null when it should a category)Additionally, we have the additional complexity that our second stage algorithm is able to partially compensate for errors in the first stage.



a False Positive at the first stage might still be correct predicted as a null in the second stage, and still result in the correct final output.

Therefore, the accuracy measurements of these two stages can interact in complex, non-linear ways.

Worse accuracy on False Positives in layer #1 does not necessarily result in lower accuracy in the final output.

In fact, worse accuracy on layer #1 could mean precisely the opposite.

We had to tilt the layer #1 towards more false positives in order to get this approach to work and produce a better final accuracy.

By default, it would create far more False Negatives, then False Positives, reducing the accuracy versus just a single layered system.

But tilting the layer #1 towards more false positives (reducing its accuracy), meant it was only filtering out entries it was highly confident were actually nulls (True Negatives), and everything else could be passed to the layer #2 for a final classification.

This resulted an improvement in the final accuracy of the model.

Thus, hurting layer #1 in that specific way made the overall model more accurate.

The filtering layer (layer #1) helped the model deal with the unbalanced classes and high degree of nulls in the output, but only when measured and used in the right way.

By itself, there is nothing wrong with having multiple stages of processing, and even multiple measurements of accuracy at these various stages of processing.

However, too much of a good thing can create new problems:It’s creates confusion when some of your metrics improve, but other ones get worse.

What has actually happened?It’s harder to communicate to stakeholders what their expectations of the model should beIts harder to understand what the final behavior of your model isIn our experience, when you are in this situation, there is only one solution: focus all R&D on improving your end-to-end or final-result metric, that is, the accuracy exiting the final stage of processing from the system.

We don’t mean that you shouldn’t measure the intermediate accuracies — they can be incredibly useful to understand how the various processing stages are interacting and to come up with ideas on what might work.

But when it comes to gauging whether an improvement is actually an improvement, only the final-result accuracy matters.

You have multiple metrics with multiple levels of granularityThis problem is similar to the problem of multiple metrics in your pipeline, but it manifests itself because of a different reason.

Let’s consider the problem of extracting data from receipts, where I saw this issue most saliently.

The basic problem is to take an image of a receipt and convert it into fully structured data:The process involved was:If you are paying attention, you’ll notice that this setup also has multiple metrics across the pipeline.

The OCR engine, letter by letter classification, and post-processing could all be making mistakes and could yield differing measurements of accuracy at each stage.

They also had different measurements depending on whether their prior input was assumed to be accurate, e.


we can measure the letter by letter accuracy assuming OCR accuracy was perfect (by inputting ground-truth OCR data), and we can measure the letter-by-letter accuracy on the raw OCR output, including mistakes.

However, I want to concentrate on another wrench that was thrown in the measurement of accuracy here in the middle section of the system, the letter by letter classification.

The most obvious way of measuring the accuracy would seem to be to literally measure the number of letter classifications that are correct:But the problem with this approach is that most of the characters are to be classified as “null”, or don’t extract.

Under this metric, it was relatively easy for us to hit accuracies that were 97% and 98% accurate.

But as we quickly learned, those supposed 2% or 3% of errors mattered a lot.

Many of the receipts we were analyzing weren’t short starbucks receipts, they were behemoths like these ones:In these receipts, almost 95% of the characters were to be categorized as null.

However, the remaining 5% of characters were the ones we cared about.

According to our traditional metric, these receipts would get a 95% accuracy even if our model did nothing but predict null.

The measurement is naturally unstable from one receipt to the next, and didn’t give us adequate insight into what was actually happening.

So the problem spawns a variety of alternative measurements of accuracy, that are meant to give us more insight into the behavior of the system and the types of errors it was making:Word level accuracy — % of words with no mistakesLine level accuracy — % of lines with no mistakesSection level accuracy — % of sections (receipts were grouped into header, items, taxes, total, footer) with no mistakesReceipt level accuracy — % of whole receipts with no mistakesIt was pretty profound to find out that our model with a character level accuracy of 98% had a receipt-level accuracy of 67%.

Receipts had on at a minimum 200 characters, so we would expect on average 4 errors per receipt at the character level.

You might suppose then, that Receipt Level Accuracy should be 0%, since on average receipts should have had at least 4 character level errors, making them a mistake by the receipt level metric.

But most of our receipts went off without a hitch, perfectly classifying every character.

Other receipts were complete clusterfucks, with as much as 20% to 30% misclassified.

When its possible, having metrics with different levels of granularity is helpful to understand how your model is failing.

But they can also add confusion and noise to your measurements, making you spin circles when some metrics improve but others become worse.

More on what to do in this situation will come in article #4.

You only have ground-truth data for an intermediate step in the systemThis issue shows up surprisingly often, in fact it showed up for us in the receipt example above as well.

You have a machine learning system where you have annotated the most important and difficult part of the learning that needs to take place.

But your algorithm has some sort of post-processing that needs to take place before it gets to the final result.

And this post-processing is based on some simple heuristics and assumptions that are specific to the problem you are experiencing.

A salient and simple example of just such a heuristic was when I was doing data extraction on spreadsheets.

Each cell in the spreadsheets had to be classified into one of a few categories.

The cells would then be grouped together into a set of outputs.

The setup was roughly as follows:The system needed to be able to handle a variety of different formats of spreadsheets that have been hand prepared by different people, with arbitrary decisions for formatting and structure.

One reasonable way to construct this system is to get all of the cells classified, something like this:Then you create some predictive features, and train an algorithm to predict the classification for each cell.

But to get the final result, you must make one more important processing step — you must group together the cells.

In our example here, the method would seem obvious: any cells on the same row should be grouped together into the same output.

The nature of 99% of spreadsheets is that people have one entry per line.

You might then, consider it safe that you only have the classified cells as ground truth data, and you don’t actually have the ground truth data on the final output for your algorithm.

You optimize your model, measure your accuracy and get 99%.

Hooray!.But wait, there’s a problem.

Inside your dataset, there were some spreadsheets that looked like this (yes this actually happened to me):Now it becomes clear that having ground-truth data only at the cell level is a liability.

A key assumption in our post-processing, which is that there is only one output per line, has been violated.

The accuracy we measured from our system was excellent.

Our code worked flawlessly as designed, and all of our unit tests passed.

But the system as a whole was still producing invalid results, and we didn’t know it initially because we only had ground-truth data on an intermediate step in the system, and not on the final result.

This problem can manifest itself in any situation where you have additional post-processing steps after your machine learning algorithm finishes, but don’t have any way to measure the accuracy of that final output.

Although this example is simple and easy to understand, there are much more complicated and nuanced situations that can occur in the stage between your machine learning algorithms output and your final result.

Always cough up the time and cash for real ground truth data to measure accuracy against, or you will get burned.

ConclusionMeasuring accuracy is hard.

Many of the best aspects of machine learning systems, like their ability to generalize to new, never before seen examples, also present a massive headache for us to measure.

Many of the problems with measuring accuracy arise from the beneficial properties of machine learning:The problem of training data being different from real world data, comes directly from how good machine learning models are capable of generalizing, so we keep pushing the boundaries of how much we want them to do so right to their limitThe problem of caring about different failure cases comes directly from the fact that our models are capable of operating in complex environments with intelligent behaviour, so our expectations on them keep getting more nuanced and specificThe problem of noise in the measurement comes directly from the fact that our models are imprecise and approximate, but this is also what allows them to generalizeThe problem of multiple pipeline metrics comes from the predictably successful behaviour of machine learning algorithms, allowing us to join them together into increasingly complex sequences and pipelinesThe problem of multiple metrics of granularity comes from the ability of our models to operate on complex, highly multidimensional problemsThe problem of only having ground truth for an intermediate step, comes from the flexibility of machine learning to be applied to specific, isolated problems and then be pieced into larger, more traditional computing systems.

Measuring accuracy is hard largely because machine learning algorithms are so good at what they do — being intelligent.

We rarely talk about measuring the accuracy of people, because we acknowledge that it’s very hard to define in advance what correct behaviour is.

Increasingly, machine learning models are going to be subject to the same constraint — accuracy will become even more muddied and more difficult to measure.

Increasingly, we are applying machine learning in environments where accuracy is hard to even define, let alone measure — think about algorithms designed to create music.

How do you measure accuracy on that?.But yet, we have now created algorithms that are pretty good at generating enjoyable music.

In the Part 3 of this series of blog articles, I will be discussing some of these more interesting, more nuanced problems in measuring accuracy.

What do you do if there is no off the shelf metric to measure accuracy for your problem?.What do you do when you can’t measure the outcome you actually care about?.What if you don’t have any ground truth data?.What if you can’t even define what accuracy is in the context of your problem?In Part 4 of this series of blog articles, I will try to present a framework for measuring accuracy better — how to measure, when to measure, and what to measure.

Most importantly, I will show that how you measure accuracy will depend on what you want to do with that accuracy number.

Whether its for early stopping or threshhold calibration, for bayesian optimization of a models parameters, for making iterative improvements to a model, for understanding the models failure cases, for communicating the performance to managers, or for communicating the performance to customers — there are different ways of measuring for different situations.

There is no one-size-fits-all solution.

Originally published at www.




. More details

Leave a Reply