For instance, if we had a document with three words in it, “dog eats food”, each word would be converted into a vector (e.
, think of it as a string of numbers).
So the word “dog” may be represented by (1,0,1,1,1,0) and “eats” by (1,1,0,0,0,0).
Now you can imagine that once all words are vectorized, a computer can now recognize that a vector (1,0,1,1,1,0) represents the word “dog”.
The tricky thing now is getting the computer to understand meaning.
This topic is out of the scope of the current article, but I may follow up with another article on this topic.
Although the NLP can successfully vectorize words to allow computers to recognize words, it is extremely difficult to get computers to understand the meanings of words.
If you’re interested in this topic, read more into Deep Learning and NLP.
Topic Modelling & Latent Dirchelet Allocation:Now that all our text data has been vectorized, we begin looking for patterns in the data.
Topic modelling is perfect for this type of task, it’s a statistical modelling technique used to discover the abstract “topics” that occur in a collection of documents.
A very simple explanation of this is that it combs through a document and recognizes: 1) the most frequently appearing words and 2) words that appear next to those frequently appearing words.
The logic here is that if these words always appear together, they must form some sort of topic.
Now you’re probably wondering, how many topics does the algorithm create?Well, that’s up to you.
The art of topic modelling comes into play when you choose how many topics you want to keep in your model.
I generally look at two things: 1) the Coherence Value and 2) The Intruder test.
Let’s elaborate a bit more on the two.
The Coherence Value can be thought of as the probability of each topic being a “good” topic.
To read more about it, check out this great article on Coherence Values.
In order to choose the best fitting model, you need to qualitatively evaluate each topic using the Intruder Method.
Above, I’ve plotted the number of topics and their corresponding coherence values.
Notice a large drop after 14 topics.
The optimal model here wold be 14 topics according to coherence values, however, an 8 topic model only reduces the CV by 1 point.
For sake of parsimony and explanatory power, I always choose to stick with the simpler model.
So what’s the Intruder Method?The Intruder test is a great follow-up test to using Coherence Values.
Once you determine how many topics you want, you then look at the topics individually and assess them qualitatively.
In other words, you want to be asking.
“What words don’t belong in these topics?”Let’s look at Topic 1 (or 0 in this case).
It’s a bit confusing because we see words like “great” and “favorite”, but also words that appear opposite in meaning like “stress” and “depression”.
This is an example of a topic that isn’t very interpretable by humans but scores high on a Latent Dirchelet Allocation (LDA).
Meaning that I’ll have to fine tune the hyper parameters of the model to get a better output.
Let’s look at topic 2 (model 1), this topic is a bit clearer on what it’s getting at.
It looks like it could refer to a topic like “likeability”.
You continuously do this for the number of topics in your model, evaluating them individually, looking for words that may or may not fit in a topic.
Creating the Recommender System:Although It may not be clear from above, my final model produced 8 topics with some very interesting insights.
For instance, I found that cannabis consumers enjoy smoking because of a few things: 1) some people smoke because they enjoy the flavors and aromas of cannabis, 2) while others do so because it makes them feel creative, 3) another segment of users do so because it makes them feel energized and finally 4) the majority of users do so because it helps with pain-relief.
That’s so cool!Essentially, what the topic models did was separate my data into customer segments.
If I was in the business of marketing and/or writing copy, I’d be better able to target customer segments with this information.
Anyways, back to the data science.
Using these 8 topics, I predicted how much of each strain’s review contained those topics.
Doing so gave me 8 features to separate my strains on.
In other words, I created a dataset based on how each strain (e.
, I was working with over 200 strains) differed on the topics I found (e.
, some strains reportedly provided more creative thinking abilities, while others increased energy).
Once this was complete, I was ready to create the recommender system based on similarities.
Choosing a similarity metric:Now that we have our strains described by 8 different features (e.
, topics) it’s time to choose how we recommend them.
There are a number of similarity metrics that can be used (e.
, Cosine Similarity, Euclidean Distance, Manhattan distance).
The important consideration when choosing distance metrics is: 1) how does it deal in high dimensional space?.(e.
, euclidean distances begin to fail in high dimensions) and 2) how accurate is the recommendation?This brings us to the question, how do we validate a recommender system?There are a number of ways to do so.
I think the best bang for your buck is to back-test your data.
For example, say our dataset contains data of customer A who purchased product A,B,C & D.
One method of validation would be to use data from Customer A and predict what they would’ve purchased next after product A.
Then validate it against what they actually purchased!ConclusionI used topic modelling and an LDA approach to find customer segments in the emerging cannabis market.
From this, I created a recommendation system.
The most difficult part of this project was procuring and cleaning the data — something that is common in all data science projects.
If I get enough interest in the article, I’ll write a technical post where I can share my code of my MVP.
The product has since evolved and is currently being used by a number of dispensaries!.Check it out here: www.
coIf you would like to see a technical post, let me know in the comments!.