Machine Learning with Python: NLP and text recognitionRoberto SannazzaroBlockedUnblockFollowFollowingFeb 3In this article i would like to apply a series of NLP techniques on a dataset containing reviews about businessess, after that I will train a model using Logistic Regression to forecast if a review is “positive” or “negative” .
No, I will not talk about drones in this article.
The natural language processing field contains a series of tools that are very useful to extract, label and forecast information starting from raw text data.
This collection of techniques are mainly used in the field of emotions recognition, text tagging (for example to automatize the process of sorting complaints from a client), chatbots and vocal assistants.
The dataset:In this article a reduced version of the yelp dataset will be used, this version contains a collection of 1000 observations, originally in JSON format, then converted into .
For this article the review dataset will be used:A glimpse of the dataset.
Made out of 9 features (‘business_id’, ‘cool’, ‘date’, ‘funny’, ‘review_id’, ‘stars’, ‘text’, ‘useful’, ‘user_id’) this dataset contains a collection of reviews made by users from yelp, for each review a user gave a score from 1 to 5 stars.
In order to create an efficient model, able to forecast if a review is “positive” or “negative” it is possible to start from a model that takes the text variable as a predictor and the stars variable as the target.
An observation from the ‘text’ variable.
Data preprocessing and explorative analysis:Once the dataset is reduced to 2 columns it is possible to conduct a small explorative analysis.
It is important to know which distribution the target variable (stars received) follows, in this way it is possible to understand if there is a bias in the dataset, so an imbalance between positive or negative reviews, this influences the results of the model, giving the propensity to predict outcomes that are more present into the training set.
As it is possible to understand from the plot there is a major component of positive reviews (5 stars), this creates an imbalance, or as said before a bias.
In order to be able to obtain useful results, it is necessary to reduce the complexity of the problem, an efficient way to do so is to divide the reviews into positive and negative, using this division as the dependent variable.
Before proceeding with any other visualization it is mandatory to apply to the text some preprocessing procedures very common in NLP:Remove any non-useful characters (slashes, punctuation, HTML tags, question marksConversion of the whole text in lowercase charactersThose two def will come very useful while preprocessing the text as described before, but not only, from here it is possible to determine which single words and a combination of words (bigrams) are more common:Combinations of words made out of 2 and 3 words.
After a small indexing adjustment we can create a bubble chart displaying the most common words in positive and negative reviews:Never trust the ‘burgers’.
And for the positive reviews:The indie atmosphere is always appreciated.
After this short but interesting insights, we can proceed into the next phase: model creation.
The model:A very simple, fast to train and very efficient algorithm is the Logistic Regression, the scikit-learn library provides a tool that helps to build this model, but before doing this and before doing the classical splitting between train and test set it is mandatory to perform few steps like stemming, vectorization and removal of stopwords:The stemming allow to reduce every word to its root, this procedure avoids ‘dispersion’ in the text, for example, conjugation of the verb ‘to be’ like: ‘am’, ‘are’, ‘is’ are converted into its root form ‘be’.
The removal of the stopwords consists into removing every word like ‘the’, ‘that’, ‘of’ that would cause a decrease in the model accuracy.
Vectorization consists into transforming every observation (review) in the dataset into a numerical representation of it, this phase is mandatory, as for every machine learning algorithm we would like to train it is necessary to input numerical data, so vectorization gives the possibility to translate text into a numeric representation of it.
Let’s take a look at a review before and after applying stemming and stopwords removal:Before stemmingAfter applying stemming the results look much more ‘raw’, but at the same time, it is still understandable as its original version.
Now it is possible to proceed with the text vectorization: the sklearn.
textCountVectorizer class offers a tool that is very simple to use; this tool needs to be initialized with the max_features argument, this argument establish the max length of the dictionary that will be created in order to represent the text, for example, after choosing 1500 as number of features the algorithm will create a dictionary based on the 1500 (features) with ht highest amount of frequency, so each review in the dataset will be represented by a list containing 1500 elements, each of them representing a feature of the dictionary created previously, with a number assigned matching the number of times a word occurred in the observation (review)Let’s check this example:For each observation (Doc 1, Doc 2, Doc n.
) a number represents the occurrences of this feature (word) in the observation (review).
To implement this in technique in Python only towo linesof code are necessary:It is now possible to split the dataset into training set and test set:Then to train the Logistic Regression model with 10 folds:It is possible to understand from the report that the accuracy is 88.
5%, and the bias toward positive reviews is quite evident as the accuracy in predicting positive reviews is much bigger then the accuracy in predicting negative reviews, this difference is more evident in the confusion matrix:Conclusions:This model is not perfect, but it does his job.
As it was mentioned before the bias toward positive reviews is quite big.
To improve this model there are some possible solutions, for example:increase the number of observations (gold rule)use a different algorithm, like Naïve Bayes, decision trees or some RNN, CNN or HAN.
Use a different stemming techniqueUse a different stopwords collectionAfter manually modifying some parameters like the class_weights is possible to slightly improve the score, this practice is certainly not the best one, but knowing that the model is biased toward positive reviews a change in the class weights, decreasing the weights for positive reviews and increasing the weights for negative can lead to a (slightly ) higher accuracy.
Would like the notebook?.Just tell me in the comments ⬇️⬇️.