A non technical intro to NLP

A non technical intro to NLPAnalyzing inaugural speeches of presidentsDivyansh RaiBlockedUnblockFollowFollowingJan 6While neural networks and CNNs have made giant leaps in the field of computer vision, Natural language processing goes underappreciated.

It is often overlooked because of not yet surpassing human level performance.

Yet, as we’re going to see through this series we can make some pretty nifty tools that can help us not only gain insights but automate tasks.

All the code mentioned is available here.

You might’ve to copy some of the helper functions I’ve written from the github link.

Mentioning it here would make everything too clustered.

To start off, we’re going to do a basic analysis of the American president’s speeches right from the first president to the 2009 Obama speech.

There are three libraries we’re going to use here1.


pyphen — To separate words into syllables 3.

matplotlib — well, for plottingAll of these can be installed using pip install.

You’ll need to download corpora.

You can do that by executing the code given belownltk.


download('stopwords')Or you can just execute nltk.

download() and download “inaugral” and “stopwords” in the corpora section after the downloader pops up, as shown in the screen capture below.

You can explore other corpus too this way.

how to download nltk corpusNow we import the nltk package and speeches with the following code(This might take a few seconds depending on your computer)import nltkfrom nltk.

corpus import stopwords from nltk.

corpus import inauguralfrom nltk.

tokenize import word_tokenize, sent_tokenizeimport matplotlib.

pyplot as pltimport pyphenAs we’ve imported the inaugural speeches now, we can take a look at the data.

We can see that we’ve got data of 56 presidents, from Washington to Obama in 2009.

Let’s take a look at the speeches.

To get the raw format of data, we can simply use inaugural.

raw() .

But as we can see, we can’t clearly divide it into words.

Fortunately, we’ve got inaugural.

words() to do the work for us.

Now that we’ve got our speech broken down into words, we can start doing some basic analysis on it.

We start by getting a frequency distribution.

This will tell us how many times does a particular word comes up.

Plus it’s already arranged in the decreasing order.

We’ve run into a problem, it’s overwhelmed with the presence of stopwords.

Often there are words and punctuation that are repeated more often than others and they don’t usually give us more information about the data.

nltk already has a small list of words like this, and they are called stopwords.

They are available for multiple languages.

We add a few more symbols to the list of stop words imported.

So now that we’ve a list of stop words, we write a small code to delete all the stop words from the speech and find out the frequency distribution again.

That’s better and gives us a few insights about the data too.

But this only gives us the data about one speech, we need something that’ll allow us to compare more president’s speeches together.

So we start counting how many 2,3,4 letter words president xyz used.

We then take the average letter count per word of each president and plot it.

“I feel happy” — has an average letter count per word of 3.

33 (1+4+5)/3.

“I exude euphoria” — has an average letter count per word of 4.


A higher letter count per word would mean the president mostly used “big” words.

We first count how many x letter words were used by each president.

While we were doing that, we also stored the average letter count per word in a variable called presidents_avg .

Using matplotlib to chart it, we can see that it has clearly decreased over time/presidents.

Going on the similar path, we start counting how many words were spoken by president xyz in one sentence.

We then take the average word count per sentence for each president and plot it.

A higher average word count per sentence would mean the president mostly used “big” sentences.

We also store the average word count per sentence in a variable called presidents_avg_words_per_sentence .

Using matplotlib to chart it, we can see that it has clearly decreased over time/presidents.

Now let’s see if analysis of hapaxes can get us anything.

In corpus linguistics, a hapax legomenon is a word that occurs only once within a context/speech.


hapaxes() gives us all the unique words in the corpus given.

But counting no.

of unique words is not enough.

We need to also divide it byt he length of the total speech.

Why?.Because speech length varies a lot, so a larger speech may have more unique words and we need to remove that bias.

So we find the unique words in one speech and count them up, average them and plot them for each president.

It seems to be decreasing a little, though it’s not very apparent.

For the final analysis we calculate the syllable per word used by each president in his speech, we use pyphen library for this as normally nouns like “Afghasnistan” have no predefined number of syllables.

We then take the average syllable per word for each president and plot it.

When we see the graph, we see that it has decreased over time.

So comparatively speaking, presidents nowadays use smaller words and shorter sentences as compared to the earlier presidents.

 This can be due to many reasons, English language itself evolves a lot over a period of 200 years but it can also be due the advances in media.

As the president’s speeches started to reach the common man who’d naturally prefer shorter sentences and smaller words, president’s speeches started to change according to the new audience.

They were less about impressing few educated men in Washington, and more about getting votes from the common man.


. More details

Leave a Reply