Review Rating Prediction: A Combined Approach

I will try to prove that combining formerly known data about each user’s similarity to other users, with the sentiment analysis of the review text itself, will help us improve the model prediction of what rating the user’s review will get.Source: pixabayThe workflowAs a first step, I will perform the RRP based on RTC analysis..The next step will be to apply a neighbors analysis to perform RRP based on the similarity between users..The final step will be to compare the three methods (RRP based on RTC, RRP based on neighbors analysis and the combination of the two) and to check the hypothesis.PreprocessingPreprocessing is a key step in any analysis and in this project as well. The head of the primary table is as follows:The head of the primary tableFirst, I deleted rows with no review text, duplicate lines and extra columns that I will not be used.The second step was to create a column that contains the results from the division of helpful numerator and helpful denominator and then to segment these values into bins..It looked like this:reviews_df = reviews_df[~pd.isnull(reviews_df['reviewText'])]reviews_df.drop_duplicates(subset=['reviewerID', 'asin', 'unixReviewTime'], inplace=True)reviews_df.drop('Unnamed: 0', axis=1, inplace=True)reviews_df.reset_index(inplace=True)reviews_df['helpful_numerator'] = reviews_df['helpful'].apply(lambda x: eval(x)[0])reviews_df['helpful_denominator'] = reviews_df['helpful'].apply(lambda x: eval(x)[1])reviews_df['helpful%'] = np.where(reviews_df['helpful_denominator'] > 0, reviews_df['helpful_numerator'] / reviews_df['helpful_denominator'], -1)reviews_df['helpfulness_range'] = pd.cut(x=reviews_df['helpful%'], bins=[-1, 0, 0.2, 0.4, 0.6, 0.8, 1.0], labels=['empty', '1', '2', '3', '4', '5'], include_lowest=True)The last step was to create a text processor that extracted the meaningful words from the messy review text.def text_process(reviewText): nopunc = [i for i in reviewText if i not in string.punctuation] nopunc = nopunc.lower() nopunc_text = ''.join(nopunc) return [i for i in nopunc_text.split() if i not in stopwords.words('english')]After being applied this had -1..Removed punctuation2..Converted to lowercase3..Removed Stop words (non-relevant words in the context of training the model)A look at the dataThe head of the primary table, after all the preprocessing, looks like this:The figures below shows how the users helpfulness range is distributed over the product rating:HeatmapBarplotOne can easily see the bias towards the higher ratings..This phenomenon is well known, and it is also supported in the same survey from above..According to that survey:“Reviews are increasingly shifting from being a place where consumers air their grievances to being a place to recommend items after a positive experience”.Later on, I will explain how the problem of the skewed data was solved (resampling methods).Step one: RRP based on Review Text ContentThe ModelsIn order to check and choose the best model, I constructed a pipeline that did the following steps..The pipeline will first perform a TF-IDF term weighting and vectorizing and will then run the classification algorithm..In general, TF-IDF will process the text using my “text_process” function from above, and then convert the processed text to a count vector..Afterwards, it will apply a calculation that will assign a higher weight to words of more importance.pipeline = Pipeline([ ('Tf-Idf', TfidfVectorizer(ngram_range=(1,2), analyzer=text_process)), ('classifier', MultinomialNB())])X = reviews_df['reviewText']y = reviews_df['helpfulness_range']review_train, review_test, label_train, label_test = train_test_split(X, y, test_size=0.5), label_train)pip_pred = pipeline.predict(review_test)print(metrics.classification_report(label_test, pip_pred))Note that I chose ngram_range = (1, 2) and that the algorithm was Multinomial Naïve Bayes.. More details

Leave a Reply