How to use it?”Now we have the clusters…How can we detect anomalies in the test data?The approach I’ve followed to classify the point as anomalous or not is the following:1 — Calculate the distance from the new points to all the Core points (only the Core points, since they are the ones actually defining the clusters) and look for the minimum (distance to the closest neighbor inside a cluster).
2 — Compare the distance to the closest neighbor inside a cluster with eps, since this is the limit between two points to be consider neighbors, this way, we find if any of the Core points are actually neighbors with our test data.
3 — If the distance is larger than eps the point is labeled as anomalous, since it has no neighbors in the clusters.
Gaussian mixture modelsProbabilistic model that assumes all the data points are generated from a mixture of a finite number of gaussian distributions.
The algorithms try to recover the original gaussian that generated this distribution.
To do so it uses the expectation-maximization (EM) algorithm, which initialize a random of n initial gaussian distribution and then tweaks the parameters looking for a combination that maximizes the likelihood of the points being generated by that distribution.
Figure obtained from Angus Turner’s blog: “Gaussian Mixture Models in PyTorch”One of the problems of Gaussian Mixture Models is that the number of clusters needs to be specified, another possibility is to use Variational Bayesian Gaussian Mixture, to avoid this problem.
Of course, just as K-Means, since the initialization of the clusters is random we can end up with a local minimum that is not optimal for our problem.
That is something that can be solved with multiple executions and then creating an average of the probabilities.
However, this solution is not optimal if the model needs to be put into production, I’m still looking for the best approach to solve this problem when a model needs to be deployed in a streaming environment.
Variational Bayesian Gaussian MixtureI don’t want to get into much detail here, there’s the scikit-learn page with the full explanation for that.
But this variation is worth mentioning.
The idea behind this model is similar to Gaussian Mixture, however, the implementation is different, here, instead of EM, variational inference algorithm is used.
Here, only a maximum number of clusters needs to be specified, the algorithm then can find the actual number of clusters and set the weight of the non-relevant ones very close to zero.
Of course this alternative is not perfect either, there are many hyperparameters to chose, more than in Gaussian mixture actually.
One of the most important is the weight_concentration_prior, which will largely affect the number of effective clusters you end up with.
How can we detect anomalies in the test data?Once the algorithm it’s trained and we get new data we can just pass it to the model and it would give us the probability for that point to belong to the different clusters.
Here, a threshold can be set, to say that if the probability is below that value the point should be consider an anomaly.
In the case of Bayesian Gaussian Mixture there is an important thing to keep in mind: Not all clusters should be considered, remember that the algorithm disregards non important clusters giving them a weight close to zero (they are not removed, but you can know which ones should be removed), what I’ve done in the past is check the probability of the point belonging only to the important clusters, to do that I’m setting a threshold for the cluster weights, to remove the non-important ones.
Why not K-Means?While K-Means is maybe the best-known and most commonly used clustering algorithm for other applications it’s not well suited for this one.
The main reason for this is that it’s only well suited when clusters are expected to have quite regular shapes, as soon as this is not fulfilled the model is not able to successfully separate the clusters.
Another reason is that all points are fitted into the clusters, so if you have anomalies in the training data these point will belong to the clusters and probably affect their centroids and, specially, the radius of the clusters.
This can cause you to not detect anomalies in the test set due to the increase in the threshold distance.
Another posibility is that you even form a cluster of anomalies, since there is no lower limit for the number of points in a cluster.
If you have no labels (and you probably don’t, otherwise there are better methods than clustering), when new data comes in you could think it belongs to a normal-behavior cluster, when it’s actually a perfectly defined anomaly.
Another disadvantage in this case is the need to specify the number of clusters a priori, we’ve already discussed that there are some parameters that are not easy to tune in the other algorithms, but I find this one to be specially tricky.
Since the data we can change with time, the number of clusters can also vary, and once we have deploy our model into production there is no easy way to decide that other without human exploration.
How can we detect anomalies in the test data?Not everything is bad for K-Means, it actually is the simplest case for the testing phase, since we have the centroids of the clusters and the shape is expected to be quite regular we just need to compute the boundary distance for each cluster (usually it’s better not to choose the maximum distance to the centroid, in case we have outliers, something like the 95th or 99th percentile should work, depending on your data).
Then, for the test data the distance to the centroids is computed.
This distance is then compared with the boundary of each cluster, if the point doesn’t belong to any cluster (distance > boundary) it gets classified as an anomaly.
To sum upI’ve presented here some clustering algorithms and explain how to use them for anomaly detection (some of them being more successful than others), obviously these are not the only methods, and I might be bias towards some of them based on the data I’ve dealt with.
I really think DBSCAN and (Bayesian) Gaussian Mixture models are the most useful clustering algorithms for this application.
If you’re getting started with anomaly detection it’s worth it to find out more about them, if they aren’t useful to you now at least is something new you’ve learnt.
Some interesing links:DBSCAN: What is it?.When to Use it?.How to use it.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a popular unsupervised learning method utilized…medium.
comHow DBSCAN works and why should we use it?First of all, this is my first story on medium, then sorry if I’m doing something wrong.
Secondly, I’m not fluent in…towardsdatascience.
Gaussian mixture models – scikit-learn 0.
2 documentationA Gaussian mixture model is a probabilistic model that assumes all the data points are generated from a mixture of a…scikit-learn.
orgIf there are more clustering algorithm that you’ve found useful for anomaly detection and I haven’t mentioned them please let me know, I would love to expand this list!Feel free to contact me through LinkedIn if you want to chat some more!.. More details