Basic NLP on the Texts of Harry Potter: Topic Modeling with Latent Dirichlet AllocationGreg RaffertyBlockedUnblockFollowFollowingDec 6I’m Greg Rafferty, a data scientist in the Bay Area..Feel free to contact me with any questions!In this post, I’ll describe topic modeling with Latent Dirichlet Allocation and compare different algorithms for it, through the lens of Harry Potter..The model consists of two tables; the first table is the probability of selecting a particular word in the corpus when sampling from a particular topic, and the second table is the probability of selecting a particular topic when sampling from a particular document.Here’s an example..Let’s say I’ve got these three (rather non-sensical) documents:document_0 = "Harry Harry wand Harry magic wand"document_1 = "Hermione robe Hermione robe magic Hermione"document_2 = "Malfoy spell Malfoy magic spell Malfoy"document_3 = "Harry Harry Hermione Hermione Malfoy Malfoy"Here’s the term-frequency matrix for these documents:Just from glancing at this, it seems pretty obvious that document 0 is mostly about Harry, a little bit about magic, and partly about wand..In this case, we’ll plot the coherence score against the number of topics:You’ll generally want to pick the lowest number of topics where the coherence score begins to level off..The difference between Mallet and Gensim’s standard LDA is that Gensim uses a Variational Bayes sampling method which is faster but less precise that Mallet’s Gibbs Sampling..Using Mallet, the coherence score for the 20-topic model increased to 0.375 (remember, Gensim’s standard model output 0.319)..It’s a modest increase, but usually persists with a variety of data sources so although Mallet is slightly slower, I prefer it for its increase in return.Finally, I built a Mallet model on the 192 chapters of all 7 books in the Harry Potter series..Here are the top 10 keywords the model output for each latent topic.. More details