Testing the waters, with NLTK

Here comes the second ‘creative’ GOT analogy.Libraries like NLTK are basically the Bran for modern-day computers (last one, promise.). With built-in pre-written codes, these libraries can perform sophisticated processing of texts in the background, while its users have to only worry about knowing which method accomplishes what. These libraries have enabled text comprehension, interpretation, sentiment analysis through only a few lines of code. The following paragraph aptly summarizes the importance of NLP in today’s context.“Today’s machines can analyze more language-based data than humans, without fatigue and in a consistent, unbiased way. Considering the staggering amount of unstructured data that’s generated every day, from medical records to social media, automation will be critical to fully analyze text and speech data efficiently.”This next section is where the real work begins!PreprocessingSo how does one even begin? If you think about it, it’s kind of intuitive. Think about how babies learn to understand language.Word by word.To be able to understand a whole, we first need to be able to decipher something at the elemental level. You cannot connect the dots without first knowing what each dot represents. The first steps in NLP involve breaking down language into words/sentences, A.K.A tokens and then trying to map the dynamics of these in relation to each other and how they form meaning.It is important to note that 99% of the work in NLP is preprocessing and organizing the data which involves tokenization and parsing, lemmatization/stemming, part-of-speech tagging, language detection and identification of semantic relationships. Only after these tasks are done, can you then begin to analyze the data. NLTK makes this 99% a lot easier.For my first attempt at NLP using NLTK, I chose to process speeches by the incumbent POTUS, Donald Trump. I found a GitHub repository of trump speeches between 2016 and 2017. Which had a total of 73 speeches in the form of text files. I randomly chose one speech to get started and I used the following methods of NLTK to explore the speech.a) Tokenizing, c) Removing Stop words, b) Parts of Speech Tagging, c) ConcordanceTOKENIZINGThe first task of any NLP project is tokenizing..It is the process of splitting up a giant string (my raw data as .txt file) into a list of words or sentences..Commonly the string is first tokenized into a list of sentences and then a list of list of words of each of those sentences..Which feels quite intuitive.An example of Tokenizing: [‘Hello World!.I am learning to use NLTK.’]Step 1: Sentence tokenizing:[‘Hello World!’, ‘I am learning to use NLTK.’]Step 2: Word tokenizing:[ [ ‘Hello’, ‘World’, ‘!’ ], [ ‘I’, ‘am’, ‘learning’, ‘to’, ‘use’, ‘NLTK’, ‘.’] ]So to get this process started I did the following steps:Imported the following libraries in a Jupiter notebook to get the process started and read my text file.import nltkfrom nltk.tokenize import word_tokenizefrom nltk.tokenize import sent_tokenizefrom nltk.corpus import stopwordsfrom nltk.text import Textimport string, reI tokenized the speech into a list of strings for each sentence..I wrote a function that removes any punctuations from a string using re.sub() method from regular expressions library..For ease of display here, I chose only the first 15 sentences of the speech to work with.REMOVING STOP WORDSStop Words are small words that can be ignored during language processing without changing the meaning of the sentence..Removing them improves efficiency (speed, memory usage) without affecting efficacy..NLTK has a list of stopwords, one for16 different languages..I imported the one for English and wrote the ‘remove_stopwords()’ function to find and remove stop words from the sentences..After removing stop words the total number of words dropped from 99 to 63, roughly 36% reduction..Considering this in the context of corpus having thousands of words.Parts of speech tagging:Ok on to the more exciting work.. More details

Leave a Reply