Let’s find out!First of all, the input for this algorithm needs to be a set of vectors.
That is, all your features should be numerical, and in the same order.
If you had any categorical features, my advice would be to use one-hot encode: convert each categorical variable into a vector of n-elements: one for each possible category, all set to 0 except the one for the given category.
What the algorithm will do is initiate k random ‘centroids’ -points in the space defined by the dimensions of the dataset’s elements-, and then it will:Assign each element to the centroid closest to it.
Remap the centroid to the point lying on the average of all the elements assigned to it.
Repeat steps 1 and 2, until convergence (or a stopping condition, such as doing N iterations for a given N) is met.
In the end, each element will have been assigned to one of k clusters, such that the elements in the same cluster all lie closest to it.
Applications for K-means clusteringLike many other unsupervised learning algorithms, K-means clustering can work wonders if used as a way to generate inputs for a supervised Machine Learning algorithm (for instance, a classifier).
The inputs could be a one-hot encode of which cluster a given instance falls into, or the k distances to each cluster’s centroid.
For this project however, what we’ll be developing will be a (somewhat rudimentary) recommender system which will, given an instance, return elements appearing on the same cluster.
Using Dask’s K-means Clustering in PythonHaving defined the concepts for this project, let’s now begin the practical part.
The code is available in the Jupyter Notebook on this repository.
Processing the DataI stored the decks following this format:N card nameM another card namein 777 different .
txt files, where each line refers to a card, and the digits before the first space are the number of apparitions for that card in the deck.
In order to transform them into a more manageable format -I’ll be using a list of tuples (Int, String) for each deck, each tuple a card-, this is what we’ll do:This is what a deck looks like now.
[(4, 'Ancient Ziggurat'), (4, 'Cavern of Souls'), (4, 'Horizon Canopy'), (1, 'Plains'), (2, 'Seachrome Coast'), (4, 'Unclaimed Territory'), (4, 'Champion of the Parish'), (1, 'Dark Confidant') .
]Where each tuple represents a card (yes, those are real card names), and how many times it appears.
Since we want to map each deck to a vector, it seems intuitive to justTurn them into a list, with one element for each different card in the whole dataset.
Set each component to the number of apparitions for the corresponding card (with all the components corresponding to cards that do not appear in the deck set to 0).
To do that, let’s get all the different cards that appear in all the decks.
And now let’s use our newfound knowledge about card names to turn all decks into beautiful, consumable vectors.
Now all our decks can be easily fed into Dask’s K-Means Clustering algorithm, and we can play with the output.
We could have just used ‘binary’ vectors, and set the components to 1 if the card appears in the deck, and 0 if it doesn’t.
We could also try that later and see if we get good results too.
Applying K-means ClusteringNow that our data is all neatly mapped to the vector space, actually using Dask’s K-means Clustering is pretty simple.
Where the most important part is the n_clusters argument, which I kind of arbitrarily set to 8.
In real life, you may want to experiment with different values.
For this particular case, I know MtG has 5 different ‘colors’ for cards.
To prevent the algorithm from just clustering the cards for their colors (which it didn’t do at all anyway), I chose a number bigger than 5.
The algorithm returns the labels as a Dask Array.
I may do an article on how to use those later, but right now I didn’t want to deal with all that.
Also, the dataset is small enough for that to not matter that much, so I decided to convert it back to a list of integers.
Exploratory Analysis: Let’s see what we gotAt first I wanted to check if the results made any sense.
This was my first time using K-means Clustering on this dataset, and I wanted to make sure it had learned something valuable.
So I just checked which cards were most frequent in the decks from each cluster.
The results were, at least to me, astounding.
Here’s what I did to check.
If you’re interested in the results, I strongly encourage you to download the notebook from the GitHub project and play with it, it’s honest fun!.I just didn’t want to mix my M:tG findings with this tutorial so that readers who are into Data Science but not into the game won’t be bored.
Card Recommendations using K-Means ClusteringNow we made that sanity check, we can proceed with the actual application for all the labels we generated.
There are many ways we could have approached the recommendation problem: given a card, suggest other cards that go well with it, without using any data about the cards except which decks they appear in (that is, no cheating and asking for more data about the cards like color, price, or an expert’s opinion).
Think for a moment, how would you use the clusters data to generate the recommendations?.I’m sure you could come up with a few ideas.
If what you came up with is not what I’m about to do, please tell me in the comments!.Creativity is more fun if it’s a team effort, and I really want to see what my dear readers come up with.
Finally, here’s what I did:As you can see, I omit how many times a card appears on a given deck for this part, and just look at the relative number of apparitions for a card on a given cluster.
I then return the cards with the most similar relative apparitions (defined by Euclidian distance).
If you’re a Magic: The Gathering player, try out this code and see the results, it makes pretty good (though kinda conservative) recommendations!ConclusionK-Means clustering allowed us to approach a domain without really knowing a whole lot about it, and draw conclusions and even design a useful application around it.
It let us do that by learning the underlying patterns in the data for us, only asking that we gave it the data in the correct format.
I encourage you to play with the code here, and try making your own recommendation’s system with a different Dataset, or solving some other problem.
If you do, please show me your results!.I wanna see what you come up with.
In the future, I’d like to do this same analysis using non-professional decks.
That way, I could make a recommendations engine for casual players (like me).
I think it would be cool if it worked with almost any card, and not just 642.
If you liked this article, please follow me on Medium, and consider helping me maintain my website’s hosting.
Let me know if you found the article helpful, or if any of it sounds wrong or doesn’t work!Happy coding.