Starting from the same initial die, the new one will be weighted on the need, copy and bill faces.
Here’s how our two dice would look like:Figure 2: Weighted dice.
Left: die for the “boiler repair” topic.
Right: die for the “bill copy” topic.
Learning the topicsOf course, we knew which faces to weigh because we know that, in our world, repair boiler makes more sense than copy boiler.
However, our computer doesn’t know what makes sense and what not; yet, it needs to find a way to figure it out.
The idea is that it can learn it by scanning all the messages we have in our data set.
The two most popular algorithms for LDA are variational inference and Gibbs sampling, and both are too complex to be described here.
Let me sketch roughly how can we use them.
Suppose we want to find 3 topics in our set of messages: this means we start with 3 identical unbiased dice (like the one in Figure 1).
The computer begins looping over and over our messages, adjusting the weights of each die at every step based on what it sees.
To put it simply, words that appear together in a certain group of messages will be assigned more weight on the same die.
At the end of the process, the algorithm will have tweaked the dice so that they are able of generating the messages it saw in our data set (as much as possible).
This means we may end up with two dice like those in Figure 2, plus the third one looking like this:Figure 3: this die expresses the “boiler cover” topic, as learned from the dataBy inspecting the weights we can conclude that the 3 topics in our set of messages are “boiler repair”, “bill copy” and “boiler cover”.
And we didn’t have to read anything!Multi-topic messagesSo in order to find the topics, our computer must learn how to generate the messages in our data set.
Most of the times though, email messages (and documents in general) contain more than one topic — if we want to generate realistic messages, we must take this into account.
Using one of the dice we saw previously would only help us writing monothematic messages.
We could use more than one die, but how do we choose which one and when to roll it?Photo by Ryan Thomas Ang on UnsplashWe could flip a coin!.For example, let’s say we want to write a message about two topics, so we use two dice.
We could flip a coin to decide which one to choose, then roll the chosen die and write the word we get.
The same for the next word, and so on until we decide to stop.
The resulting message will be a bit chaotic and the order of the words not coherent, but the presence of two topics should be easily detectable by looking at the words.
In this configuration, every word will have the same chance to be drawn from Topic 1 or Topic 2.
However, we could also add a weight to the coin so that our biased coin could choose a topic die more often than the other: we could generate a message 80% about “boiler repair” and 20% about “boiler cover”.
This weight can be computed by the same algorithms we mentioned in the dice-building phase: in fact, all the weights are computed at the same time.
Scaling upLeft die is a 20-face die, also known as a Icosahedron.
In real life we won’t get very far by using only six words.
To write a message in English, we would need a 180,000-face die.
Not only that: we could have dozens of topics across all our documents, so we would need another, say, 100-face weighted die instead of a simple coin.
Luckily, this is something we can do by running LDA on a computer: building special dice, that is estimating probability distributions.
After all, a standard die is nothing else than a uniform probability distribution over 6 different values.
A topic in English is a non-uniform probability distribution over 180,000 different values (words).
A document can be a non-uniform probability distribution over 50 different topics.
Or at least, this is what Latent Dirichlet Allocation assumes.
Conclusions — what’s next?I hope this post helped to shed more light on the assumptions behind Latent Dirichlet Allocation, which is still one of the most popular approaches to Topic Modelling.
In the next post I would like to go through a step-by-step implementation of LDA in Python using Scikit-Learn and pyLDAvis, with a section about how to create a report where every document is tagged with the assigned topics.