# Outlier Detection with Extended Isolation Forest

Outlier Detection with Extended Isolation ForestEryk LewinsonBlockedUnblockFollowFollowingMar 10After half a year since my first article on anomaly detection, one of its readers has brought to my attention the fact that there is a recent improvement to the Isolation Forest algorithm, namely Extended Isolation Forest (EIF), which addresses major drawbacks of the original method.

In this article I give a quick reminder on the original IF algorithms, describe the potential problem with it and how EIF handles it.

At the end I will present a Python example how to use both algorithms and compare their performance.

1.

Isolation ForestIsolation Forest algorithm utilises the fact that anomalous observations are few and significantly different from ‘normal’ observations.

The forest is built on the basis of decision trees, each of the trees having access to a sub-sample of the training data.

In order to create a branch in the tree, first a random feature is selected.

Afterwards, a random split value ( between min and max value) is chosen for that feature.

If the given observation has lower value of this feature then the one selected it follows the left branch, otherwise the right one.

This process is continued until a single point is isolated or specified maximum depth is reached.

Partitioning algorithm.

Source: [1]In principle, outliers are less frequent than regular observations and are different from them in terms of values (they lie further away from the regular observations in the feature space).

That is why by using such random partitioning they should be identified closer to the root of the tree (shorter average path length, i.

e.

, the number of edges an observation must pass in the tree going from the root to the terminal node), with fewer splits necessary.

The anomaly score is created on the basis of all trees in the forest and the depth the point reaches in these trees.

2.

Isolation Forest’s ProblemI believe the best way to understand the issue is to see it on an example.

Motivation for EIF.

Source: [1]In the left picture we can see data sampled from multivariate normal distribution.

Intuitively, we would assume that the anomaly score assigned to the observations would increase radially from the central point of the distribution [0, 0].

However, this is clearly not the case, as seen in the right image.

What is more, there are also rectangular artifacts of lower score, such as the vertical one between point 0 and 1 on the x axis.

Let’s move on to the second example.

Here we see two blobs centred at points [0, 10] and [10, 0].

By inspecting the right figure we see not only the artifacts that were present before, but also two ghost clusters (approximately at [0, 0] and [10, 10]).

Motivation for EIF.

Source: [1]The reason for this peculiar behaviour originates from the fact that the decision boundaries of the Isolation Forest are either vertical or horizontal (random value of a random feature), as seen in the picture below, where the authors plot branch cuts generated by the IF during the training phase.

We see that the branches tend to cluster where the majority of the points are located.

But as the lines can only be parallel to the axes, there are regions that contain many branch cuts and only a few or single observations, which results in improper anomaly scores for some of the observations.

An example might be points around [3, 0] (many branch cuts) and [3, 3] (few cuts).

Branch cuts generated during training of IF.

Source: [1]3.

Extended Isolation ForestAnalysis of the Isolation Forest’s drawback led to a conclusion that the problem is caused by only horizontal/vertical branch cuts.

Extended Random Forest addresses that issue by approaching the problem a bit differently.

Instead of selecting a random feature and then random value within the range of data it selects:random slope for the branch cutrandom intercept chosen from the range of available values from the training dataThese are the terms (slope/intercept) you most likely recall from the simple linear regression (y = ax + b).

Let’s look at a visual example!.From the image below we can see the main difference from the original IF algorithm -> cuts that are not parallel to the axes.

EIF Partitioning algorithm.

Source: [1]Extended Random Forest generalizes well into higher dimension, where instead of straight lines we are dealing with hyperplanes.

For a deeper dive into N-dimensional generalization, please refer to [1] for a very approachable explanation.

Let’s wrap up the theoretical explanation by looking at the difference in the anomaly score maps generated by IF/EIF.

In the images below we see that the anomaly score spreads out from the data clusters radially and there are no artifacts/ghost clusters visible.

An extra feature captured by the EIF is the higher anomaly score region directly in-between the two clusters (where they kind of link).

This region can be considered as close to ‘normal’ given the proximity to both clusters, but with higher score as it is far from the concentrated groups.

4.

Example in PythonDataFor this short exercise I use Forest Cover dataset downloaded from here.

The dataset contains 286048 observations and 10 features.

The observations are labeled, so we do know up front which ones are anomalous (0.

9% of observations are anomalous).

Another aspect of the comparison can be speed, as the authors of [1] state that there is no (significant) decrease in speed in Extended Isolation Forest.

Isolation Forest in sklearnIn all models I will try to use the same settings, meaning:number of trees in the forest = 100maximum number of samples to draw for estimating each tree = 256I know up front that there is 0.

9% anomalies in the dataset and I will use this percentage to select the highest anomaly scoresThe Forest is able to correctly identify 8.

7% of anomalies in the dataset.

Performance of sklearn’s IFIsolation Forest in eifBy setting ExtensionLevel to 0 I am estimating a regular Isolation Forest.

First of all, as of now there is no way of setting random state for the model, so running it multiple times might yield different results.

Also the eif implementation does not have that many parameters to configure.

Another thing is that the model predicts anomaly scores, but does not automatically identify which observations are considered outliers.

To identify the anomalies, I sort the anomaly scores and retrieve indices of 0.

9% of observations with highest scores.

Performance of eif’s IFThe performance of this implementation of Isolation Forest turns out to perform worse that the sklearn ‘s one.

It only captures 6.

3% of anomalies in the dataset.

What is more, due to lack of random state each time we run this model the performance changes.

Extended Isolation Forest in eifThis part is very similar to the vanilla Isolation Forest case (eif implementation) with the difference being the ExtensionLevel.

To work on fully extended level, I set the level to 9 (number of dimensions – 1).

As can be seen in the results, the model fails to identify a single anomaly in the dataset.

Unfortunately, I do not have any explanation for this and if anyone knows what the issue might be, please let me know in the comments and I will update this post.

Performance of EIFConclusionsThe Extended Isolation Forest algorithm certainly is interesting and worth further exploring.

It easily overcomes the limitations of the original model on a set of artificial examples, however, there seems to be some kind of problem when implementing it on a real life dataset.

What is more, the current eif implementation is nearly as fast as the sklearn one.

On my MacBook Pro the sklearn IF took 14s to train, while the eif implementations took roughly 10 minutes.

I really hope the algorithm will be further improved and will serve as a good tool for identifying anomalies.

As always, you can find the code used for this article on my GitHub.

In case you have any questions or feedback, please let me know in the comments.

References[1] Extended Isolation Forest.. More details