Introducing TextBlobA Python library for processing textual data, NLP framework, sentiment analysisSusan LiBlockedUnblockFollowFollowingJan 9As an NLP library for Python, TextBlob has been around for a while, after hearing many good things about it such as part-of-speech tagging and sentiment analysis, I decided to give it a try, therefore, this is the first time I am using TextBlob to perform natural language processing tasks.
The Yelp dataset is a subset of its businesses, reviews, and user data for use in personal, educational, and academic purposes.
Available as JSON files, we will only use yelp_academic_dataset_review.
json and yelp_academic_dataset_user.
json and can be downloaded from here.
The DataThe data sets are in JSON format, to be able to read in pandas data frame, we load JSON data first, then normalize semi-structured JSON data into a flat table, then use to_parquet to write the table to the binary parquet format.
Later when we need it, we load a parquet object from the file path, returning a pandas Data Frame.
The following process gives us two data tables, user and review.
pyuser tablereview tableWe merge user table and review table, use suffix to deal with the same column names, and remove zero stars.
user_review = (review.
merge(user, on='user_id', how='left', suffixes=['', '_user']).
drop('user_id', axis=1))user_review = user_review[user_review.
stars > 0]Star Rating Distributionx=user_review['stars'].
title("Star Rating Distribution")plt.
xlabel('Star Ratings')rects = ax.
patcheslabels = x.
valuesfor rect, label in zip(rects, labels): height = rect.
get_x() + rect.
get_width()/2, height + 5, label, ha='center', va='bottom')plt.
show();Figure 1Good to know that most of the reviews star ratings are pretty high, and not many terrible reviews.
Obvious, there’s an incentive for businesses to solicit as many good reviews as possible.
Reviews Per Year vs.
Star Ratings Per Yearfig, axes = plt.
subplots(ncols=2, figsize=(14, 4))user_review.
bar(title='Reviews per Year', ax=axes);sns.
lineplot(x='year', y='stars', data=user_review, ax=axes)axes.
set_title('Stars per year');Figure 2user_review.
value_counts()Figure 3Yelp was founded in 2004, there were over 4,000 people have been Yelp members since then in our data.
Let’s have a look a sample review:review_sample = user_review.
ilocprint(review_sample)Let’s check the polarity of this sample review.
Polarity ranges from -1 (most negative) to 1 (most positive).
sentimentThe above review has a polarity of about -0.
06, meaning it is slightly negative, and a subjectivity of about 0.
56, meaning it is fairly subjective.
To proceed faster, we will sample 1 million reviews from our current data, and add a new column for polarity.
sample_reviews = user_review[['stars', 'text']].
sample(1000000)def detect_polarity(text): return TextBlob(text).
polaritysample_reviews['polarity'] = sample_reviews.
head()Figure 4First several rows look good, stars and polarity are in line with each other, meaning the higher the star, the higher the polarity, as it should be.
Distribution of Polaritynum_bins = 50plt.
figure(figsize=(10,6))n, bins, patches = plt.
polarity, num_bins, facecolor='blue', alpha=0.
title('Histogram of polarity')plt.
show();Figure 5Most polarity scores are above zero, meaning most of the reviews are positive sentiment in the data, this is in line with the star rating distribution we discovered earlier.
Polarity Grouped by Starsplt.
boxenplot(x='stars', y='polarity', data=sample_reviews)plt.
show();Figure 6In general, this is as good as we’d expect.
Let’s investigate deeper and see whether we can find anything interesting or outlier.
Reviews that have the lowest polarity:sample_reviews[sample_reviews.
polarity == -1].
head()Reviews that have the lowest star ratings:sample_reviews[sample_reviews.
stars == 1].
head()They all look as we expect what the negative reviews are.
Reviews that have lowest polarity (most negative sentiment) but with a 5-star:sample_reviews[(sample_reviews.
stars == 5) & (sample_reviews.
polarity == -1)].
head(10)Figure 7Reviews that have the highest polarity (most positive sentiment) but with a 1-star:sample_reviews[(sample_reviews.
stars == 1) & (sample_reviews.
polarity == 1)].
head(10)Figure 8Both tables look weird.
Apparently, some polarity does not agree with its associate rating.
Why is that?After digging a bit more, it turns out, TextBlob goes along finding words and phrases it can assign polarity and subjectivity to, and it averages them all together for longer text, such as our Yelp reviews.
Want to learn how TextBlob calculates polarity and subjectivity, this article from Aaron Schumacher has given a simple yet clear explanation.
I enjoyed learning and playing with TextBlob.
I realized that TextBlob can be used to accomplish many other NLP tasks such as part-of-speech tagging, noun phrase extraction, classification, translation, and more, and we will get our hands dirty on them later.
Jupyter notebook can be found on Github.
Enjoy the rest of the week!References:TextBlob documentationBook: Hands-on Machine Learning for Algorithmic Trading.