To find out, I clustered my data (using K-Means) repeatedly, increasing my number of clusters and plotting the silhouette score associated with each number of clusters along the way.
At a high level, a silhouette score is a measurement that takes into account cohesion and separation of clusters.
The denser individual clusters are and the further apart they are from one another, the better.
A low silhouette score reflects strong clustering.
As expected, the Silhouette score drops as I increased the number of clusters.
But it looks like we start to hit a point of diminishing returns around 6 clusters, so I chose 6 to be my ideal number of groups.
Finally, I used a dimensionality reduction technique known as NMF (Non-Negative Matrix Factorization) to partition my 42,234 titles into 6 groups (groups are formally referred to as “topics” in the context of dimensionality reduction with NLP projects).
NMF’s inner workings are out of the scope of this blog post, but you can read more about NMF here.
NMF allows us to reduce the dimensions of our matrix by going from a word-space to a topic-space.
In this situation, we compress a 20,857-element long word-space into a 6-element long topic-space.
We lose some information during this process, but gain increased interpretability.
By modeling 6 topics, I am expecting to see each topic correspond to a cuisine.
Ok enough with all this crazy data science voodoo.
Let’s see some results!ResultsHere are the top 10 most important words for each of the 6 topics:After I modeled my topics, I inspected them to see if they made sense in the context of the problem I am trying to solve.
It looks like topic 1 is a series of classic American lunch ingredients, with an emphasis on fat and cheese.
I named this topic Burger.
If you inspect topic 2, it is obviously related to pizza/Italian food.
I named this one Pizza.
It’s nice how specific this topic is!Topic 3 is a little all over the place.
It’s tough to tell what the essence of this topic is from its top 10 words alone, so I looked at the top 30 and saw words such as masala, turmeric, wonton, and noodles.
I named it Asian Fusion.
Topic 4 is distinctly a Dessert topic.
Like the Pizza topic, I love how obvious this one is.
I named Topic 5 Breakfast, though an argument can be made that its essence is more about healthy-sandwich-type ingredients.
Finally, I named Topic 6 American Entree.
This topic is rather broad compared to the other ones.
Let’s use T-SNE to visualize how the titles fall into different topics:Interesting.
There is strong separation for the Pizza, Burger, and Asian Fusion topics (pink, red, and blue).
The American Entree, Dessert, and Breakfast topics (yellow, green, and purple) all have some areas where they are distinct, but also intermingle significantly in the largest “blob”.
This means that these topics aren’t well separated from one another.
I observed this to be anecdotally true after creating a classifier on top of these topics.
My classifier will take a string as input and convert it to a 6-length vector of probabilities, corresponding to the 6 topics here.
Whichever topic has the highest probability is the output of the classifier.
You can try it out yourself here!You’ll notice that the classifier makes some good predictions and some bad ones.
For example, it’ll classify “Rhubarb Pie”, “Chicken chow mein”, and “Flatbread slice with anchovies and mozzarella” correctly, but it thinks that “lobster mac and cheese” is a burger and that “penne” is asian fusion.
ConclusionsContrary to my expectations, r/foodporn title topics didn’t fall neatly into cuisines, but they do reflect an underlying structure to the posts.
Some groups are easily distinguishable, whereas others intermingle a lot.
The intermingling between some topics reflects that people are posting titles that span multiple topics, corresponding to foods that aren’t easy to categorize.
Given the amount of cuisine fusion and experimentation going on in the food-world today, this makes a lot of sense.
Dishes today are harder to categorize than ever before because they are increasingly made up of ingredients from around the world.
The topics that don’t intermingle also make sense.
Dessert is so distinct because people aren’t mixing “chocolate” with “ribs”, “pie” with “barbecue”, and “cookies” with “onions”.
Similarly, there are things that make Pizza and Burger distinctly themselves.
Thanks for reading my post and I hope you found it interesting!.Feel free to reach out with any questions/comments.
Project Code: GithubProject App: Cuisine ClassifierMore about me: LinkedIn.