First order of business, we’ll need to create a stemmer.
You can create an Italian stemmer like so:stemmer = SnowballStemmer("italian")If you are working in a different language, note that you can print all the languages that the Snowball Stemmer handles with this line of code:print(" ".
languages))Now we can convert all the words to their stems with more fancy programming footwork:stems = [stemmer.
stem(word) for word in tokens]The line of code above basically translates to, “for every word (or token) in the tokens list, transform it into its stem, and store this new list of stems in a variable aptly called stems.
”You can also print the stems with this line of code:print(stems)Now that we’ve got our stems, we’re ready for some frequency analysis!Calculating word frequenciesNow for the part where we can attempt to analyze Dante’s Inferno based on word frequencies!.I titled this section “Calculating word frequencies,” but the computer will be the only one doing math — we’ll just be writing a few succinct lines of code.
That’s the beauty of NLTK!To store a frequency distribution in a variable, we simply say:fdist = FreqDist(stems)At this point, some programmers might do data visualizations, but if I’m not mistaken, repl.
it doesn’t have this capability — you’ll have to download Python and the appropriate packages to your computer if you want to explore data viz.
However, we can definitely print out values!.With this line of code, I can print out the 100 most common words in Dante’s Inferno:print(fdist.
most_common(100))Cool, a list of the most common words!What conclusions can we draw by looking at word frequencies?.Or, what questions are sparked by examining these frequencies?We might notice that the word piu (more), is used 181 times, and the stem tant- (much), is used 80 times.
These words suggest that hell is a place of extremes.
We might also notice that the stem l’altr- (the other), is used 94 times, which could lead to an investigation of duality in Dante’s Inferno.
We could examine the most common verbs, which have to do with seeing and speaking, along with the word occhi (eyes), that appears 50 times.
These words indicate the passivity with which Dante takes in the underworld.
Pretend you’re a grad student in Italian literature.
What else can you think of?Photo by Willian West on UnsplashFurther readingProgramming is for everyone, including humanists!.Check out the resources below to learn more about NLTK and data analysis with Python.
DataCamp, a great resource for learning about Natural Language Processing and other data analysis with PythonThe Programming Historian, a great resource for digital humanities tutorialsCoding & English Lit: Natural Language Processing in Python, a tutorial about analyzing English textHow to Use NLTK Porter & Snowball StemmersRemoving Stop Words with NLTK in Python.. More details