The man with a suit and a straw hat

That seems abnormal.

This is a common situation of anomaly detection in cases like predictive maintenance.

In data scientists’ wording, it is sometimes not a single feature that says about an anomaly, but a combination of multiple features.

We should not look at the picture in its part but step back to see its entirety.

There are different ways to detect an anomaly.

The most common approach is to create a normal profile of situations, and then classify everything else as abnormal.

Here, we will look into a rather different kind of model that learns from a few instances of anomalies — Isolation Forest.

The word “isolation” implies we can find anomaly through the notion of distance.

Intuitively, the more abnormal a data point is, the more isolated it is from the rest.

In real-world applications, most datasets have multiple dimensions.

For example, for predictive maintenance, relevant features can be the machine’s vibration, sound, and so on.

To simplify, we just consider here a one-dimensional data shown as a few blue circles.

Humans can observe easily, the one being most isolated to the right, is an anomaly.

But how does a machine go about finding it?Let’s use needles as an illustration.

Suppose someone is flying above, randomly throwing a needle that will end up standing still on the line.

And there is an equal chance for that needle to land in any position from the furthest left to the furthest right sample.

If the needle lands between 3 and 4, it isolates 4.

If the needle lands between 1 and 2, it isolates 1.

If the needle lands between 2 and 3, it does not isolate any, and thus we continue throwing more needles until at least one sample is isolated.

We replicate the process recursively until the isolation of all samples.

For each sample, we count the number of needles used until its isolation.

And this number is what we call an anomaly score.

As the position of the landed needle is uniformly distributed, the more distant a sample is, the more likely it is isolated faster, and hence fewer needles might be needed.

Based on the anomaly scores from the existing dataset, we could define a threshold, such that any future data with an anomaly score below this threshold will be classified as an anomaly.

By the end of the experiment, we would have split all samples apart, each with an assigned anomaly score.

As shown in a “tree” below, those with smaller anomaly scores are closer to the root of the tree, since they are isolated at an early stage.

Those with larger scores are further apart from the root, as they are isolated later.

Therefore, such a tree characterizes the isolation status of all samples, showing roughly how abnormal a specific sample is compared to the others.

Now, before going to the “forest”, let’s stay a few moments on an essential concept in machine learning — “variance”.

You might have come across the notion of Bias-Variance tradeoff.

But we will skip “bias” for now.

What might be the problem of using this simple model in practice?.The real world data unobserved to us could have different patterns.

In other words, the anomaly detection threshold derived from the sample data might not work out well for the population data.

That’s when we have a problem of high variance, or as we often put it, the model overfits to the training data.

One solution to this is instead of building just one tree, we could use multiple trees.

That’s where the term “forest” comes in.

In the context of the needle experiment, we would repeat the same procedure many times.

Since there is randomness in each experiment, by averaging the anomaly scores from different models, we can derive a final model that is more capable of capturing anomalies in the population.

With isolation forest and a dataset with a few anomalies, we are able to create certain detection rules, that allow us to say something about new sensor values, new financial transactions, or maybe, about the man with a suit and a straw hat.