Scalable Jaccard similarity using MinHash and Spark

Scalable Jaccard similarity using MinHash and SparkA simple algorithm makes it much easier to calculate similarity matrices at scale.

Schaun WheelerBlockedUnblockFollowFollowingApr 17It occurred to me a little while ago that the Jaccard similarity coefficient has probably cropped up in my work more than any other statistic except for the arithmetic mean.

If you have two sets of things (words, parts of words, attributes, categories, or whatever), you can take the number of things in the intersection of the sets and divide by the number of things in the union of the sets.

The resulting metric is a meaningful measure of similarity that has the added virtue of being pretty easily explainable to non-technical folks.

Jaccard similarity gets a little difficult to calculate directly at scale.

If you have a really large list of entity-attribute pairs, and you want an entity-by-entity similarity matrix, you basically have to do an inner join, group by entity and count, then do an outer join, group by entity and count, and then join the results of the two joins together.

If your workflow uses Spark, as mine does, that’s a whole lot of shuffling.

It’s expensive.

A while ago, a colleague pointed me to something that I feel like I should have known but didn’t: MinHash.

For each entity, randomly permute the attributes, then hash them (convert them to integers), then take the minimum.

Do that a bunch of times, then calculate the percentage of times the MinHashes from identical draws for two entities match.

We can interpret that metric the the same way we would interpret the Jaccard similarity between those two entities’ attribute sets.

So it brings the problem down from huge numbers of attributes to small numbers of hashes; but even better, it brings the problem from variable numbers of attributes — with all of the pains of key skew — to the same number of MinHashes across all entities.

Most everything from lines 36 through 52 in the following code snippet comes from Patrick Nicholson, the colleague who told me about MinHash, who adapted the hashing algorithm from Spark’s spark.



MinHashLSH implementation.

I built the join logic to turn the MinHash results into actual Jaccard similarities, and wrapped the whole thing in a function to make it more portable.

The function requires a Spark DataFrame, a string indicating the column of the DataFrame that contains the node labels (the entities between which we want to find similarities), and the column that contains the edges (the attributes we will hash).

The function outputs a data frame with the two columns of node labels — each with a suffix as stipulated by the suffixes keyword argument — and the Jaccard similarity.

The Jaccard similarity becomes more precise with more draws.

One hundred draws (the default in the code below) gives precision up to 0.


Five hundred draws gives precision up to 0.


One thousand draws gives precision up to 0.


You get the idea.

This little thing has saved me a lot of time and headaches over the last several months.

I thought it was worth sharing.

.. More details

Leave a Reply