Presidential Debate Sentiment Analysis with LSTM, OneVsRest, LinearSVC: NLP Step-By-Step Guide

What are the most frequently used words in positive/negative tweets?You will learn about fundamental Natural Language Processing skills including:Text pre-processingTokenizationWord embedding with TF-IDFModelling with LSTM, logistic regression, OneVsRest, LinearSVC, etc.

Evaluation with F1 score, precision, recall, accuracyAn end-to-end roadmap for NLP projects would be provided at the end of this article.

Load data and take a quick look into the dataimport numpy as np import pandas as pd import nltknltk.

download('stopwords')from nltk.

corpus import stopwordsfrom sklearn.

feature_extraction.

text import CountVectorizerfrom keras.

preprocessing.

text import Tokenizerfrom keras.

preprocessing.

sequence import pad_sequencesfrom keras.

models import Sequentialfrom keras.

layers import Dense, Embedding, LSTM, SpatialDropout1Dfrom keras.

utils.

np_utils import to_categoricalimport refrom sklearn.

model_selection import train_test_split, cross_val_scorefrom sklearn.

linear_model import LogisticRegressionfrom sklearn.

svm import LinearSVCfrom sklearn.

multiclass import OneVsRestClassifierfrom sklearn.

linear_model import RidgeClassifierThere are 21 columns in the dataset.

We only keep the column “text” and “sentiment” here.

Shape of the dataset.

We have 13871 rows of record.

There are 3 unique values in the “sentiment” column.

Note that the dataset is imbalanced which means the number of records for each category is not equally.

Randomly check a tweet from the dataset.

Split the dataset into random train, validation and test subsets“train_test_split” is a method in scikit-learn that split arrays or matrices into random train and test subsets.

X_train, X_test, y_train, y_test = train_test_split(df['text'], df['sentiment'], test_size=0.

33, random_state=42)X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.

2, random_state=42)Now we have 3 subsets: train, validation, and test.

X_train includes the tweets, y_train includes the corresponding sentiment.

Text Pre-processingDefine a function text_prepare for text pre-processing which are handling following tasks:replace symbols in “REPLACE_BY_SPACE_RE” with white space from input textdelete symbols in “BAD_SYMBOLS_RE” from input textextend stop words list with ‘rt’, and ‘http’remove stop words from textProcess the text in training dataset as follows.

Apply on validation and test dataset.

What are the most common words?For each word calculate how many times they occur in the train dataset.

Sort the dictionary to fetch top 10 common words.

Word Embedding with TF-IDFMachine Learning algorithms work with numeric data and we cannot use the provided text data like “@JebBush said he cut FL taxes by $19B”.

We need to transform text data to numeric vectors which is called “Word embedding” before feeding them to the models .

vectorizationTF-IDFThe TF-IDF approach (Term Frequency Inverse Document Frequency) extends the bag-of-words framework by taking into account total frequencies of words in the entire dataset of collected tweets.

Comparing with bag-of-words, TF-IDF penalizes too frequent words and provides better features space.

Use class TfidfVectorizer from scikit-learnFilter out too rare words (occur less than in 5 titles)Filter out too frequent words (occur more than in 90% of the tweets).

Use 2-gram along with 1-gramtext -> vectorsFinally, we are ready to try out different models.

1st Model: Logistic regressionUse LogisticRegression from sklearn.

linear_modelLogisticRegression(C=1.

0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class=’warn’, n_jobs=None, penalty=’l2', random_state=None, solver=’warn’, tol=0.

0001, verbose=0, warm_start=False)Cross-validation mean accuracy 67.

86%, std 0.

38.

2nd Model: LinearSVCCall LinearSVC from sklearn.

svmCross-validation mean accuracy 63.

76%, std 0.

45.

3rd Model: OneVsRestCall OneVsRestClassifier from sklearn.

multiclassEvaluation of the OneVsRestClassifierInterpretation of evaluation criterion can be found at this document.

F1-micro is preferable is preferred because our class is imbalance.

Difference between Micro- and macro-averages can be found at this link.

4th Model: LSTM with KerasRecall that we have imported the keras libraries before.

from sklearn.

feature_extraction.

text import CountVectorizerfrom keras.

preprocessing.

text import Tokenizerfrom keras.

preprocessing.

sequence import pad_sequencesfrom keras.

models import Sequentialfrom keras.

layers import Dense, Embedding, LSTM, SpatialDropout1Dfrom keras.

utils.

np_utils import to_categoricalSave the output of text pre-processing in another pandas data frame “X”.

Use Tokenizer from kerasCreate LSTM modelPay attention to the activation function and optimizer adam.

Encoding the prediction column “sentiment” (Positive, Neutral, Negative)Train the LSTM model for 20 epochsCongrats!.You just went through the fundamental technologies in Natural Language Processing including:Text pre-processingTokenizationWord embedding with TF-IDFModelling with LSTM, logistic regression, OneVsRest, LinearSVC, etc.

Evaluation with F1 score, precision, recall, accuracyThe data source used in this project could be found at this link.

The Roadmap of NLP projects can be downloaded at this link.

Next step would be getting your hands dirty by playing with cat.

Nope, I mean by coding it up.

Good luck.

Sentiment Analysis on Lourdes (my cat).. More details

Leave a Reply