Profiling my Favorite Songs on Spotify through clustering

We could group songs with similar characteristics together, and profile each cluster.

One type of clustering method is K-means Clustering which is what I will be using to analyse my songs.

For clustering, we want the points in the same cluster to be as close as possible.

We also want the distance between clusters to be far from each other as possible.

This makes each cluster look compact while being spaced out from each other.

Here is a visualization of what clustering looks like for 4 clusters.

The green dot represents each cluster centroid (centre of the cluster).

K-means clustering taken from http://shabal.


htmlBecause clustering relies on distance, the scale of our data will affect the results.

For example, if we want to cluster by height, from 1.

5m to 1.

9m, and weight, from 60kg to 80kg.

Thus, the points spreads across the height axis by 0.

4 and the weight by 20.

This means that weight will be dominant in determining the clusters.

We can standardize the range of the data such that the features will influence the result.

cluster_features = ['acousticness', 'danceability', 'instrumentalness', 'energy', 'speechiness']df_cluster = df_recent[cluster_features]X = np.

array(df_cluster)scaler = StandardScaler()scaler.

fit(X)X = scaler.

transform(X)After getting an idea of what clustering does, how many types/groups of songs do we listen to?.One way is to make an educated guess based on your own knowledge.

If you listen to all types of music from edm to hip hop to jazz and many more, you can give a higher number like… 7 maybe?.Because K-means clustering requires us to specify the number of clusters we want, we can set k=7, where k is the number of clusters.

Another way is to use the help of the elbow method to determine the number of clusters.

In the elbow method, we can perform clustering for a range of set cluster numbers, e.


k=1, k=2, …, k=9, k=10.

For each k, we will take each point and measure its squared distance to their cluster centroid, and sum them up.

This is called the sum of squared distances (SSD).

SSD measures how close each points are to the cluster centroid.

Therefore, the smaller the SSD, the closer the points in the same cluster are.

ss_dist = []K = range(1, 11)for k in K: km = KMeans(n_clusters=k, init='k-means++', random_state=123) km = km.

fit(X) ss_dist.



plot(K, ss_dist, 'bx-')plt.


ylabel('Sum of squared distances')plt.

title('Elbow Method For Optimal k')plt.

show()So if we plot the SSD for each k we will get curved line as shown below:From the plot above, as k increases, SSD decreases.

This is makes sense because points may have a closer cluster to be assigned to, resulting in a lower SSD.

Earlier, I mentioned that we want the points in each cluster to be as close as possible.

However, we cannot choose k=10 because it is the lowest.

Imagine this.

If we choose k=N, where N is the number of songs, we are having each song as its own cluster and thus SSD will be 0.

This is because the cluster centroid of each point is the point itself.

Instead, we are going to choose k such that if we add another cluster, SSD decreases slightly.

This is known as the elbow method.

If we think of the curve as our arm, we get a steep slope at the beginning which suddenly becomes gentle midway.

This gives it its “elbow” shape.

Based on the elbow method, the number of clusters recommended is 4 because the line became gentle from k=4 to k=5.

However, I’ve also played around with k=5 and found that I like the clusters given.

Therefore, in this post I will be sharing the results I got for k=5 instead.

Cluster VisualizationGreat we finally have our cluster!.So how does our cluster look like?.Unfortunately, at this we are unable to view it yet.

This is because our clusters are formed using 5 features.

If you think of each feature as a dimension, you get 5-D.

As we can view images up to 3-D, we will need to perform a technique called dimension reduction.

This allows us to reduce from 5-D to any dimensions lower.

To try and explain it as intuitively as possible, dimension reduction aims to make a low dimensional set of features from a higher dimension while preserving as much information as possible.

If you wish to get a better understanding of what it does you may watch this video about Principal Component Analysis(PCA), which is one of the methods in dimension reduction.

Let’s see how much data is preserved if we use PCA to reduce the dimension.

The blue bar shows how much information each principal component (PC) contributes to the data.

The first PC contributes 40% of information about the data.

The second and third contributes 20% each.

The red line shows the cumulative information of the data by the PCs.

By reducing from 5 dimensions to 2, 60% information of the data is preserved.

Likewise if we were to reduce to 3 dimensions, 80% information of the data is preserved.

Now let’s see how our clusters look like on a 2-D and 3-D scatter plot.

The points in the 2-D scatter plot overlaps with each other and may not look like the clustering was done well.

However if we were to view it from a 3-D perspective, we can see the clusters better.

Let’s try another method called t-Distributed Stochastic Neighbor Embedding(t-SNE).

t-SNE performs well for visualizing high dimension data.

For more details, you may read this tutorial.

In this case, a 2-D t-SNE scatter plot is able to visualize the 5 clusters nicely.

We can also roughly tell that cluster 3 is the biggest cluster and cluster 0 or 1 is the smallest.

Let’s see how the clusters are distributed using a bar chart.

Cluster ProfilingNow, we can make sense of what the characteristics of the different clusters are.

Let’s compare the distribution of features across clusters.

# set binning intervals of 0.

1bins = np.

linspace(0,1,10)# create subplots for number of clusters(Rows) and features(Cols)num_features = len(cluster_features)f, axes = plt.

subplots(num_clusters, num_features, figsize=(20, 10), sharex='col'row = 0for cluster in np.


unique()): df_cluster = df_recent[df_recent['cluster'] == cluster] col = 0 for feature in cluster_features: rec_grp = df_recent.


cut(df_recent[feature], bins)).


reset_index(name='count') cluster_grp = df_cluster.


cut(df_cluster[feature], bins)).


reset_index(name='count') sns.

barplot(data=rec_grp, x=feature, y='count', color='grey', ax=axes[row, col]) sns.

barplot(data=cluster_grp, x=feature, y='count', color='red', ax=axes[row, col]) axes[row, col].

set_xlabel('') axes[row, col].

set_xticklabels(range(1,10)) if col > 0: axes[row, col].

set_ylabel('') if row == 0: axes[row, col].

set_title(feature) col += 1 row += 1 f.

suptitle('Profile for each clusters') plt.

show()Each Row represents the cluster, 0 to 4, and the each Column represents the feature.

The grey bar represents the distribution of the feature.

This allows us to get a rough idea of the distribution of the feature.

The red bar represents the distribution of the feature in that cluster which is used to compare against the other clusters.

When we look at the distribution of each cluster we can see that each cluster is high or low in certain features.

This is identified by whether the red bar is on the right(high) or left(low) with respect to the grey bar.

From these characteristics we can profile them and even come up with a cluster identity.

Cluster 0 (Instrumental): High instrumentalness.

Low speechiness.

Cluster 1 (Lyrical): High danceability, energy, speechiness.

Low acousticness, instrumentalness.

Cluster 2 (Chill vibes): High danceability.

Low energy, instrumentalness, speechiness.

Cluster 3 (Dance): High danceability, energy.

Low acousticness, instrumentalness, speechiness.

Cluster 4 (Wind down): High acousticness.

Low danceability, instrumnetalness, energy, speechiness.

We can also profile by taking the average of the cluster feature and plotting them onto a radar chart.

This might be easier to view the differences between all the cluster at a glance.

The readings of the radar chart is similar to the profile given above.

We can also see that cluster 2 and 4 have a similar stats.

The difference is that cluster 2 is more focused on danceability and cluster 4 is more focused on acousticness.

Cluster sampleLet’s see if the songs in each cluster fits the cluster profile.

Here are 3 songs in each cluster and you can give it a listen and see if it makes sense:Cluster 0 (Instrumental): Go Back Home by FKJHypnotised by ColdplayLibertango by Astor Piazzolla, BondCluster 1 (Lyrical):September Rose by Cailin RussoCandlelight by Zhavia WardBBIBBI by IUCluster 2 (Chill vibes):Drop the Game by Flume, Chet FakerLivid by ELIZAFind a Way by Matt Quentin, Rinca YangCluster 3 (Dance): Ultralife by Oh WonderLittle Man by The Pb UndergroundFinesse (Remix) [feat.

Cardi B] by Bruno Mars,Cardi BCluster 4 (Wind down): Frozen by Sabrina ClaudioBreak the Distance 2.

0 by Ashton EdminsterSomeone To Stay by Vancouver Sleep ClinicConclusionWe first looked at the different features over time and try to figure out if there was a shift in music taste.

From the filtered data set, we performed our cluster analysis.

We then visualized to get a rough idea of what it looks like and to ensure that the clustering is fine.

Finally we plotted the distribution of each feature and profiled them.

At the end of the day, we are able to get a better understanding of the type of songs that we like.

The collection of data can be found here and the analysis can be found here on my Github.

Thank you for reading and hope you found this interesting.

Please feel free to provide your feedback in the comments section below or reach out to me on my LinkedIn.

Hope you have a great week ahead!ReferencesPlotting Facet charts:https://seaborn.



htmlStep by step guide to Principal Component Analysis (PCA):https://www.


com/watch?v=FgakZw6K1QQTutorial on t-SNE:https://www.


com/community/tutorials/introduction-t-snePlotting Radar/Spider Charts:https://python-graph-gallery.


. More details

Leave a Reply