And these are the nine steps from vaping to kids!And now is time to give some more insights on the methodology followed.
The source on which I based myself was providing the SQL syntax used to extract the subreddit representation vector input data.
I will not go to that extent as anyway this is non-public data, but will comment on what we need to achieve here.
Representation VectorThe shop representation vector is built around the concept of latent semantic analysis, where instead of words and paragraphs we have shops and shared buyers.
In order to build the representation vector we need first a table of the number of shared buyers for each shops against each other shops.
As I mentioned, Shopify hosts more than 600k shops, you can figure that this could become quite big to produce, so some rules and limits will need to be enforced.
In the same way that the reddit analysis representation vector was not representing all subreddit versus all other subreddit co-occurrences (the vector only represents co-occurrence of comments with 2,000 of the top subreddits), we choose our vector to represent shared buyers for only the first thousand most “shared” shops.
We also excluded a number of extremely popular shops which are just “too” present everywhere and which are basically collinear with all other shops in term of sharing buyers.
Once we have that table, the next step is to extract a positive pointwise mutual information matrix.
The following piece of code takes as input a pandas dataframe for which the index is a shop_id and for which each column is the number of shared buyers for this list of a thousand or so most “shared” shops.
For more details on the pointwise mutual information you can also refer to Document Summarization Using Positive Pointwise Mutual Information.
Also many kudos to Max Shenfield and his Python implementation of the subreddit distance calculation.
And pointing out that the Euclidean distance of normalized vectors is a metric equivalent to the cosine similarity between those vectors.
Distance CalculationDistance metric and distance calculation selection is bounded for the major part by the choice of clustering algorithm and what it support.
We chose to go with the Minkowski distance with p=2 (which is basically the Euclidean distance).
We made also some experiments with different values of p.
Notably, p=3 seems to provide slightly better results (personal taste) although this is not necessarily supported for all the different clustering we may want to try.
Now that we have the positive pointwise mutual information matrix, the process of getting distances, and more importantly close neighbors is quite straightforward.
For example below, you can get the ten closest neighbors from a specific example point.
ClusteringHaving selected this distance metric now allows to do clustering.
Here we had to go through a lot of back and forth.
We currently have the intuition that our clusters have certain properties:There are a number of outliers, shops with no or little close enough neighbors,Not all clusters present the same “density”.
How do we know about that?.The first one is easy, by looking at the distance to the closest neighbor for all shops we can see that the closest one is quite far a a good proportion of the shops.
Those shops should not cluster but should be “removed” as outliers either as an initial step, or by the clustering algorithm if if supports outliers detection.
The second one is more of a hunch at this point resulting from a number of observations.
For example, we observed a number of clusters which were country or language specific.
Although those cluster present a certain interest, it would be better if they could further separate by center of interest.
Cluster based on specific center of interest is what we observe when we look only at US based merchants for example.
This hunch would lead us toward using a variable density based clustering algorithm and we are looking forward trying HDBSCAN in the future for that purpose.
We haven’t got time to explore it yet.
Still we tried DBSCAN which leads to some interesting results, however the implementation we use is quite slow… so for experimentation purpose we came to rely on k-means, after removing “outliers” i.
shops where the closest neighbor is further than a cut-off distance.
After clustering, we also re-classify certain clusters as “outliers” when they contain less than a certain number of shops.
We are in a way “faking” some of the DBSCAN properties using k-means.
The basic required for k-means clustering can be summarized with below code:The fact that we used 80 cluster as a “magic” number may raise some questions.
That “magic” number comes in fact from the analysis of the Calinski Harabaz Score versus the number of clusters for a range of possible number of clusters values.
Running below code we obtained the following graph and estimated the elbow at around 80.
PresentationFinally there is a need for presenting those clusters visually as in the pictures you saw in the first part of this post.
For that purpose we used a Linear Discriminant Analysis method and were interested in the first two dimensions separating the most the clusters we obtained from k-means clustering.
Obviously I’m skipping on a lot of implementation details, for example removing outliers clusters from the 80 k-means detected clusters.
But the gist of the method lies within what I exposed above.
Hope this can help you apply a similar methodology to find similarity between different concepts.
As you see this method not only yields interesting results for words in a document and subreddits classification but can also apply to similarity between shops on a SasS platform.
Let me know of other interesting applications you put the method to use!Cover picture by Ake on Rawpixel.
Originally published at thelonenutblog.
com on February 6, 2019.