A Light Introduction to Text Analysis in RBrian WardBlockedUnblockFollowFollowingMay 3Working with Corpora, Document-Term Matrices, Sentiment Analysis, etc…IntroductionThis is a quick walk-through of my first project working with some of the text analysis tools in R.
The goal of this project was to explore the basics of text analysis such as working with corpora, document-term matrices, sentiment analysis etc…Packages usedtmSentimentAnalysissyuzhetOther : tidyverse, SnowballC, wordcloud, RColorBrewer, ggplot2, RCurlQuick Look at the Data SourceI am using the job descriptions from my latest web-scraping project.
Which is about 5300 job postings pulled from Indeed.
We are going to focus on the job descriptions here, as they contain the most text and information.
Let’s take a look at our first job description to see what we’re working with.
postings1$job_descriptionAs you can see, it is a large string containing all of the text from the job listing.
Creating a CorpusA corpus (corpora pl.
) is just a format for storing textual data that is used throughout linguistics and text analysis.
It usually contains each document or set of text, along with some meta attributes that help describe that document.
Let’s use the tm package to create a corpus from our job descriptions.
corpus <- SimpleCorpus(VectorSource(postings1$job_description))# And lets see what we haveview(corpus)You can see that our outermost list, is of a type = list, with a length = 5299, the total number of job descriptions (or documents) we have.
When we look at the first item in that list, , we see that this is also of a type = list, with a length = 2.
If we look at these two items we see there is content , and meta.
Content is of a type = character and contains the job description text as string.
Meta is of a type = list, with a length of 7.
These are the 7 meta attributes that are automatically added to the simple corpus even though I did not have any values for them.
author = emptydate-time-stamp = another list… but empty for my datadescription = emptyheading = emptyid = ‘1’ (automatically created by position)language = ‘en’ (the default of the tm package I’m assuming)origin = empty.
And there you have it.
That’s the general format of a simple corpus.
Keep in mind that you can edit the meta attributes to include whatever you want.
Transformations : Cleaning our CorpusTransformations in the tm package refer to the pre-processing or formatting of the text that we might want to do before any analysis.
We are going to perform 5 quick transformations, that will prepare our data for the analysis.
Stripping any extra white space:dfCorpus <- tm_map(dfCorpus, stripWhitespace)# 2.
Transforming everything to lowercasedfCorpus <- tm_map(dfCorpus, content_transformer(tolower))# 3.
Removing numbers dfCorpus <- tm_map(dfCorpus, removeNumbers)# 4.
Removing punctuationdfCorpus <- tm_map(dfCorpus, removePunctuation)# 5.
Removing stop wordsdfCorpus <- tm_map(dfCorpus, removeWords, stopwords("english"))Most of these transformations are self-explanatory except for the remove stop words function.
What exactly does that mean?.Stop words are basically just common words that were determined to be of little value for certain text analysis, such as sentiment analysis.
Here is the list of stop words that the tm package will remove.
stopwords(“english”)Now that we have transformed our job descriptions, let’s take a look at our first listings again to see what has changed.
corpus[]$contentStemmingStemming is the process of collapsing words to a common root, which helps in the comparison and analysis of vocabulary.
The tm package uses The Porter Stemming Algorithm to complete this task.
Let’s go ahead and stem our data.
dfCorpus <- tm_map(dfCorpus, stemDocument)And now let’s take a look at our job description one last time to see the differences.
corpus[]$contentGreat, now all of job descriptions are cleaned up and simplified.
Creating a Docoment-Term Matrix (DTM)A document-term matrix is a simple way to compare all the terms or words across each document.
If you view the data simply as a matrix; each row represent a unique document and each column will represent a unique term.
Each cell in that matrix will be an integer of the number of times that term was found in that document.
DTM <- DocumentTermMatrix(corpus)view(DTM)As you can see, the DTM is not actually stored as a matrix in R, but is of the type = simple_triplet_matrix.
Which is just a more efficient way of storing the data.
You can get a better idea of how they are formatted here.
For our purposes, it’s better to think of it as a matrix which we can see with the inspect() function.
inspect(DTM)So, we can see that we have 5296 documents (removed three NA’s) with 41735 terms.
We can also see an example matrix of what the DTM looks like.
Now let’s take a look at what the most frequent words are across all of the job postings.
Creating a Word Cloud of the Most Frequent TermsTo do this we are going to first convert the DTM into a matrix so that we can sum the columns to get a total term count throughout all of the documents.
I can then pick out the top 75 most frequent words throughout the entire corpus.
note: I chose to use a non-stemmed version of the corpus so that we would have the full words for the word cloud.
sums <- as.
matrix(DTM)))sums <- rownames_to_column(sums) colnames(sums) <- c("term", "count")sums <- arrange(sums, desc(count))head <- sums[1:75,]wordcloud(words = head$term, freq = head$count, min.
freq = 1000, max.
pal(8, "Dark2"))So, nothing too crazy here, but we can get a good sense of how powerful this tool can be.
Terms like support, learning, understanding, communication, can help paint a picture of what these companies are looking for in a candidate.
Sentiment Analysis“Sentiment (noun) : a general feeling, attitude, or opinion about something” — Cambridge English DictionarySentiment Analysis is simple in its goal but is complicated in its process to achieve that goal.
Sanjay Meena has a great introduction worth checking out:Your Guide to Sentiment AnalysisSentiment Analysis helps you discover people’s opinions, emotions and feelings about your product or servicemedium.
comWe are going to start out using the ‘SentimentAnalysis‘ package to do a simple polarity analysis using the Harvard-IV dictionary ( General Inquirer) which is a dictionary of words associated with positive (1,915 words) or negative (2,291 words) sentiment.
sent <- analyzeSentiment(DTM, language = "english")# were going to just select the Harvard-IV dictionary results .
sent <- sent[,1:4]#Organizing it as a dataframesent <- as.
frame(sent)# Now lets take a look at what these sentiment values look like.
head(sent)As you can see each document has a word count, a negativity score, a positivity score, and the overall sentiment score.
Let’s take a look at the distribution of our overall sentiment.
summary(sent$SentimentGI)Okay so overall our job descriptions are positive.
The minimum score in all of the documents was0.
0, so it looks like the companies were doing a good job writing their job descriptions.
Now for fun let’s take a look at the top and bottom 5 companies based off their sentiment score.
# Start by attaching to other data which has the company names final <- bind_cols(postings1, sent)# now lets get the top 5 final %>% group_by(company_name) %>% summarize(sent = mean(SentimentGI)) %>% arrange(desc(sent)) %>% head(n= 5)# And now lets get the bottom 5 final %>% group_by(company_name) %>% summarize(sent = mean(SentimentGI)) %>% arrange(sent) %>% head(n= 5And there ya go.
Now, this is obviously not a great use-case for sentiment analysis, but it was a good introduction to understand process.
EmotionsOne more fun thing we can do is pull out emotions from the job descriptions.
We will do this with the syuzhet package using the NRC emotion lexicon, which relates words with associated emotions as well as a positive or negative sentiment.
sent2 <- get_nrc_sentiment(postings1$job_description)# Let's look at the corpus as a whole again:sent3 <- as.
frame(colSums(sent2))sent3 <- rownames_to_column(sent3) colnames(sent3) <- c("emotion", "count")ggplot(sent3, aes(x = emotion, y = count, fill = emotion)) + geom_bar(stat = "identity") + theme_minimal() + theme(legend.
major = element_blank()) + labs( x = "Emotion", y = "Total Count") + ggtitle("Sentiment of Job Descriptions") + theme(plot.
title = element_text(hjust=0.
We already know that the job descriptions are mostly positive, but it is ineresting to see trust and anticipation with higher values as well.
It is easy to see how this could be applied to other types of data such as reviews or comments to simplify large sets of textual data into quick insights.
Thanks for reading, I hope this walk-through might have helped some other beginners trying to get started with some of R’s text analysis packages.
I would love to hear any questions or feedback, as I am just getting started myself.