A Complete Exploratory Data Analysis and Visualization for Text Data

')cl = df.

loc[df.

polarity == -0.

97500000000000009, ['Review Text']].

sample(2).

valuesfor c in cl: print(c[0])Figure 3It worked!Univariate visualization with PlotlySingle-variable or univariate visualization is the simplest type of visualization which consists of observations on only a single characteristic or attribute.

Univariate visualization includes histogram, bar plots and line charts.

The distribution of review sentiment polarity scoredf['polarity'].

iplot( kind='hist', bins=50, xTitle='polarity', linecolor='black', yTitle='count', title='Sentiment Polarity Distribution')Figure 4Vast majority of the sentiment polarity scores are greater than zero, means most of them are pretty positive.

The distribution of review ratingsdf['Rating'].

iplot( kind='hist', xTitle='rating', linecolor='black', yTitle='count', title='Review Rating Distribution')Figure 5The ratings are in align with the polarity score, that is, most of the ratings are pretty high at 4 or 5 ranges.

The distribution of reviewers agedf['Age'].

iplot( kind='hist', bins=50, xTitle='age', linecolor='black', yTitle='count', title='Reviewers Age Distribution')Figure 6Most reviewers are in their 30s to 40s.

The distribution review text lengthsdf['review_len'].

iplot( kind='hist', bins=100, xTitle='review length', linecolor='black', yTitle='count', title='Review Text Length Distribution')Figure 7The distribution of review word countdf['word_count'].

iplot( kind='hist', bins=100, xTitle='word count', linecolor='black', yTitle='count', title='Review Text Word Count Distribution')Figure 8There were quite number of people like to leave long reviews.

For categorical features, we simply use bar chart to present the frequency.

The distribution of divisiondf.

groupby('Division Name').

count()['Clothing ID'].

iplot(kind='bar', yTitle='Count', linecolor='black', opacity=0.

8, title='Bar chart of Division Name', xTitle='Division Name')Figure 9General division has the most number of reviews, and Initmates division has the least number of reviews.

The distribution of departmentdf.

groupby('Department Name').

count()['Clothing ID'].

sort_values(ascending=False).

iplot(kind='bar', yTitle='Count', linecolor='black', opacity=0.

8, title='Bar chart of Department Name', xTitle='Department Name')Figure 10When comes to department, Tops department has the most reviews and Trend department has the least number of reviews.

The distribution of classdf.

groupby('Class Name').

count()['Clothing ID'].

sort_values(ascending=False).

iplot(kind='bar', yTitle='Count', linecolor='black', opacity=0.

8, title='Bar chart of Class Name', xTitle='Class Name')Figure 11Now we come to “Review Text” feature, before explore this feature, we need to extract N-Gram features.

N-grams are used to describe the number of words used as observation points, e.

g.

, unigram means singly-worded, bigram means 2-worded phrase, and trigram means 3-worded phrase.

In order to do this, we use scikit-learn’s CountVectorizer function.

First, it would be interesting to compare unigrams before and after removing stop words.

The distribution of top unigrams before removing stop wordstop_unigram.

pyFigure 12The distribution of top unigrams after removing stop wordstop_unigram_no_stopwords.

pyFigure 13Second, we want to compare bigrams before and after removing stop words.

The distribution of top bigrams before removing stop wordstop_bigram.

pyFigure 14The distribution of top bigrams after removing stop wordstop_bigram_no_stopwords.

pyFigure 15Last, we compare trigrams before and after removing stop words.

The distribution of Top trigrams before removing stop wordstop_trigram.

pyFigure 16The distribution of Top trigrams after removing stop wordstop_trigram_no_stopwords.

pyFigure 17Part-Of-Speech Tagging (POS) is a process of assigning parts of speech to each word, such as noun, verb, adjective, etcWe use a simple TextBlob API to dive into POS of our “Review Text” feature in our data set, and visualize these tags.

The distribution of top part-of-speech tags of review corpusPOS.

pyFigure 18Box plot is used to compare the sentiment polarity score, rating, review text lengths of each department or division of the e-commerce store.

What do the departments tell about Sentiment polaritydepartment_polarity.

pyFigure 19The highest sentiment polarity score was achieved by all of the six departments except Trend department, and the lowest sentiment polarity score was collected by Tops department.

And the Trend department has the lowest median polarity score.

If you remember, the Trend department has the least number of reviews.

This explains why it does not have as wide variety of score distribution as the other departments.

What do the departments tell about ratingrating_division.

pyFigure 20Except Trend department, all the other departments’ median rating were 5.

Overall, the ratings are high and sentiment are positive in this review data set.

Review length by departmentlength_department.

pyFigure 21The median review length of Tops & Intimate departments are relative lower than those of the other departments.

Bivariate visualization with PlotlyBivariate visualization is a type of visualization that consists two features at a time.

It describes association or relationship between two features.

Distribution of sentiment polarity score by recommendationspolarity_recommendation.

pyFigure 22It is obvious that reviews have higher polarity score are more likely to be recommended.

Distribution of ratings by recommendationsrating_recommendation.

pyFigure 23Recommended reviews have higher ratings than those of not recommended ones.

Distribution of review lengths by recommendationsreview_length_recommend.

pyFigure 24Recommended reviews tend to be lengthier than those of not recommended reviews.

2D Density jointplot of sentiment polarity vs.

 ratingsentiment_polarity_rating.

pyFigure 242D Density jointplot of age and sentiment polarityage_polarity.

pyFigure 25There were few people are very positive or very negative.

People who give neutral to positive reviews are more likely to be in their 30s.

Probably people at these age are likely to be more active.

Finding characteristic terms and their associationsSometimes we want to analyzes words used by different categories and outputs some notable term associations.

We will use scattertext and spaCy libraries to accomplish these.

First, we need to turn the data frame into a Scattertext Corpus.

To look for differences in department name, set the category_colparameter to 'Department Names', and use the review present in the Review Text column, to analyze by setting the text col parameter.

Finally, pass a spaCy model in to the nlp argument and call build() to construct the corpus.

Following are the terms that differentiate the review text from a general English corpus.

corpus = st.

CorpusFromPandas(df, category_col='Department Name', text_col='Review Text', nlp=nlp).

build()print(list(corpus.

get_scaled_f_scores_vs_background().

index[:10]))Figure 26Following are the terms in review text that are most associated with the Tops department:term_freq_df = corpus.

get_term_freq_df()term_freq_df['Tops Score'] = corpus.

get_scaled_f_scores('Tops')pprint(list(term_freq_df.

sort_values(by='Tops Score', ascending=False).

index[:10]))Figure 27Following are the terms that are most associated with the Dresses department:term_freq_df['Dresses Score'] = corpus.

get_scaled_f_scores('Dresses')pprint(list(term_freq_df.

sort_values(by='Dresses Score', ascending=False).

index[:10]))Figure 28Topic Modeling Review TextFinally, we want to explore topic modeling algorithm to this data set, to see whether it would provide any benefit, and fit with what we are doing for our review text feature.

We will experiment with Latent Semantic Analysis (LSA) technique in topic modeling.

Generating our document-term matrix from review text to a matrix of TF-IDF features.

LSA model replaces raw counts in the document-term matrix with a TF-IDF score.

Perform dimensionality reduction on the document-term matrix using truncated SVD.

Because the number of department is 6, we set n_topics=6.

Taking the argmax of each review text in this topic matrix will give the predicted topics of each review text in the data.

We can then sort these into counts of each topic.

To better understand each topic, we will find the most frequent three words in each topic.

topic_model_LSA.

pyFigure 29top_3_words = get_top_n_words(3, lsa_keys, document_term_matrix, tfidf_vectorizer)labels = ['Topic {}:.'.

format(i) + top_3_words[i] for i in lsa_categories]fig, ax = plt.

subplots(figsize=(16,8))ax.

bar(lsa_categories, lsa_counts);ax.

set_xticks(lsa_categories);ax.

set_xticklabels(labels);ax.

set_ylabel('Number of review text');ax.

set_title('LSA topic counts');plt.

show();Figure 30By looking at the most frequent words in each topic, we have a sense that we may not reach any degree of separation across the topic categories.

In another word, we could not separate review text by departments using topic modeling techniques.

Topic modeling techniques have a number of important limitations.

To begin, the term “topic” is somewhat ambigious, and by now it is perhaps clear that topic models will not produce highly nuanced classification of texts for our data.

In addition, we can observe that the vast majority of the review text are categorized to the first topic (Topic 0).

The t-SNE visualization of LSA topic modeling won’t be pretty.

All the code can be found on the Jupyter notebook.

And code plus the interactive visualizations can be viewed on nbviewer.

Happy Monday!.

. More details

Leave a Reply