Comparing Classification Algorithms — Multinomial Naive Bayes vs.
Logistic RegressionChristopher BratkovicsBlockedUnblockFollowFollowingJun 13Problem StatementThis article will describe the process of gathering text data from reddit.
com and evaluating prediction accuracy of Multinomial Naive Bayes and Logistic Regression classification algorithms.
The specific text data being utilized is gathered from PS4 and Xbox One subreddit posts, using the Python Reddit API Wrapper, PRAW.
The main goal of this analysis is to compare the results of the Multinomial Naive Bayes model versus the Logistic Regression model as well as to identify the pros and cons of both models.
The success of this analysis will be determined by the accuracy of predictions generated by each utilized binary classification model.
Check out this link for more info on using the PRAW API => https://praw.
io/en/latest/Phase 1: Data GatheringData for this analysis was gathered using PRAW, which controls access privileges to the Reddit.
In order to use PRAW, all I needed to do was create a web app using my Reddit account and specify my client ID, client secret, and user agent within a Jupyter Notebook.
As the desired data was gathered, I added each post’s data to a list, then converted the lists into each data variables dictionary as columns and rows within two Pandas data frames, one for the PS4 subreddit and one for the Xbox One subreddit.
Phase 2: Data Cleaning & EDAVerification of the gathered data began with a simple check for non-existent, or null, values within the newly created data frames.
None of the data showed null values except for each post’s text and html text data.
This was likely due to text data not existing in many posts that could have contained images, GIFs, etc.
Since less than 20% of the rows in each of the data frames did not contain text, I determined that dropping the rows with non-existent text values was appropriate.
The next step was to remove redundant rows and rows that didn’t provide meaningful context with respect to my analysis.
The only feature that didn’t provide meaningful context was the number of subscribers to each subreddit, which I took note of for further potential analyses, then removed from each data frame.
Redundant features that were removed include html text, which is redundant to text, ups, which is redundant to score, and domain name, which is redundant to subreddit name.
The text, score, and subreddit name variables were kept because of their easier interpretability and usability.
I then concatenated the PS4 data frame with the Xbox One data frame, shuffled the values, and reset the new data frame’s index in order to remove bias from the data.
The last step was to engineer a new feature which converted the subreddit name associated with each post to a binary column titled, ‘is_ps4’, which represented posts from the Xbox One subreddit as zero and posts from the PS4 subreddit as one.
Phase 3: Natural Language ProcessingI used CountVectorizer as the primary natural language processing tool in order to analyze the frequency of significant terms within each subreddit post.
Before applying CountVectorizer to my text data, I generated a baseline of the percentage of row data gathered from the PS4 (46.
4%) versus XboxOne (53.
I also applied a train-test split to divide the text data and target variable that is the binary representation of subreddit, into a set of training data and a set of holdout data.
Holdout data is data used from the original data set to measure against the predicted values generated by the training data in order to determine model accuracy.
CountVectorizer was then used in order to divide the training and holdout text data into two data frames with columns representing each term in the text data and rows representing the presence of each term in their individual text values.
Phase 4: Data Modeling & Model EvaluationBoth Logistic Regression and Multinomial Naive Bayes models were instantiated, fit to the training data, and optimized for hyper parameters using a pipeline for each model.
The two models, Multinomial Naive Bayes and Logistic Regression, were evaluated and compared using confusion matrices pre and post optimization, shown below.
Logistic Regression Results:Original Logistic Regression Confusion MatrixOptimized Logistic Regression Confusion MatrixLogistic Regression Findings:True negative and false negative predictions increased post-optimization.
True positive and false positive predictions decreased post-optimization.
Original Classification Accuracy = 78.
59%Optimized Classification Accuracy = 77.
78%Multinomial Naive Bayes Results:Original Multinomial Naive Bayes Confusion MatrixOptimized Multinomial Naive Bayes Confusion MatrixMultinomial Naive Bayes Findings:True positive and true negative predictions increased post-optimization.
False positive and false negative predictions decreased post-optimization.
Original Classification Accuracy = 78.
32%Optimized Classification Accuracy = 80.
49%Phase 5: ConclusionsOne evident conclusion is that the Multinomial Naive Bayes model showed a consistently better performance relative to the Logistic Regression model on unseen data.
Despite the Logistic Regression model performing better on seen data, the more important factor is a model’s performance on data it is unfamiliar with.
This shows that the Logistic Regression model was more vulnerable to bias within the data than the Multinomial Naive Bayes model.
However, both models should be improved, as they both show evidence of being overfit to the data.
Evidence of overfitting data models when a model shows significantly better performance on the data used to train the model than on the data used to validate the model’s prediction accuracy.