Don’t make this mistake when clustering time series data!María García GumbaoBlockedUnblockFollowFollowingMar 9Disclaimer: This is not original work, the aim of this post is to spread and make more accesible the content of the paper “ Clustering of Time Series Subsequences is Meaningless: Implications for Previous and Future Research” by Eamonn Keogh & Jessica Lin.

Recently, I’ve started doing some research in how to apply clustering techniques to time series and I was surprise to find very little posts about the topic (specially comparing it to the number of posts I found when I wanted to learn about LSTMs).

Luckily, I found this paper from EAmonn Keogh & Jessica Lin, which is really interesting (and I encorage you to read), but for those of you who don’t have time (or don’t feel like it) I decided to sum it up in a post.

There are two types of clustering when dealing with time series data:Whole clustering: When you have several time series which can come from different machines and want to compare them.

Subsequence clustering: When the original data is one long time series that needs to be broken into parts to do clustering on those parts.

This papers focuses on the second type of time series clustering, and makes the disruptive claim that clustering of time series subsequences is meaningless!Specifically they talk about subsequences obtain by a sliding windows, they make a note that other methos for obtaining this subsequences can produce successful results and even propose a new method (which I won’t get into here).

What’s their proof?To proof this claim they perform a series of experiments using different clustering algorithms and different datasets to make sure the difference is only made by the subsequencing algorithm.

What they are doing is generate k clusters from the subsequences of a given dataset and k clusters from the subsequences of a random dataset.

Once this is done they compare the clusters obtain in the two cases and observe that this clsuters are similar, even though the data is not.

This doesn’t happen using a whole clustering approach, as it can be seen in this figure from the paper:Image obtained from the original paperThe way to measure the “cluster meaningfulness” (z axis) is obtaining a value for the stability of the clusters (similar clusters for different runs with different initial centroids) and dividing it by the difference of the clusters obtain for the different datasets (one of them being random).

If this value is close to zero it means that the clusters are stable and far from the clusters generated for the random dataset.

This clusters can be said to be meaningful.

However, if this is close to one it means that the differences between different runs over the same data are similar to the differences with the clusters from the random dataset, this doesn’t make sense, since clusters generated from different datasets shouldn’t be similar; a value of 1 means that the clusters are not meaningful at all.

As we can see in the picture this is exactly what happens when doing subsequence clustering, the clusters obtained are not meaningful.

Since this happens for many different algorithms and datasets they conclude this is a consequence of the way the subsequence extraction is done.

But why?When studying the results obtain in this clusters they found that when the length of the subsequences are much smaller than the original length all cluster centers are found to have a sinusoidal shape.

The cluster center is the average value of all the subsequences in the cluster, but whatever the shape of these subsequences is the result is a perfect sine wave.

In the paper they show this with an example, I won’t explain it in depth here, but even though the shapes of the subsequences looks nothing like a sine wave, the results after clustering are the following:Image obtain from the original paperConclusionWe need to be careful when doing clustering over subsequences of time series data.

This has proven that a sliding window technique for obtaining the subsequences yields meaningless clusters, even though this technique was supposed to be usefull and definitely well known (it had been used in many published papers).

There are other ways of obtaining this subsequences and some of them can give optimal results, however, we always have to keep an eye on the clusters generated, to make sure the centers make sense with our data and they are representated of behaviours really present on the dataset.

If you want to understand this in more detail I recommend you read the original paper, is all very clear and well explained, the aim of this post was to sum it up so I didn’t want to get into much detail.

Feel free to ask me any doubts you may have, of course, but remember I haven’t done the original work, just read it and made most I could out of it.

Feel free to contact me through LinkedIn if you want to chat some more!.