NLP Kaggle Competition

NLP Kaggle CompetitionIntroductory Notebook and Exploratory Data AnalysisTara BoyleBlockedUnblockFollowFollowingFeb 4The Quora Insincere Questions Classification competition is a natural language processing task where the goal is to predict if a question’s intent is sincere.

Quora is a service that helps people learn from each other.

On Quora, people ask and answer questions — and a key challenge in providing this type of service is filtering out insincere questions.

Quora is attempting to filter out toxic and divisive content to uphold their policy of :Be Nice, Be Respectful.

What is an insincere question?An insincere questions is defined as a question intended to make a statement rather than look for helpful answers.

According to the Kaggle competition description characteristics of an insincere questions include:Having a non-neutral tone:Having an exaggerated tone to underscore a point about a group of people.

Are rhetorical and meant to imply a statement about a group of people.

Are disparaging or inflammatory:Suggest a discriminatory idea against a protected class of people, or seeks confirmation of a stereotype.

Make disparaging attacks/insults against a specific person or group of people.

Are based on an outlandish premise about a group of people.

Are Disparaging against a characteristic that is not fixable and not measurable.

Aren’t grounded in reality:Based on false information, or contains absurd assumptions.

Using sexual content (incest, bestiality, pedophilia) for shock value, and not to seek genuine answers.

Basically, any question who’s intent is to anger or offend and that is not asked for the purpose of gaining information.

Some examples of insincere questions include:Why do Chinese hate Donald Trump?Do Americans that travel to Iran have a mental illness?We can clearly see these questions are intended to inflame and not to gain information and need to be excluded from Quora’s platform.

Expected DifficultiesLarge DatasetThe training data has over one million rows.

I expect there will be challenges in dealing with the large dataset.

Challenges may include running into memory errors and excessive processing times.

To combat the large size of the dataset, there are several techniques to try, including using smaller samples of the data for training and dimensionality reduction.

I anticipate feature selection and engineering as well as model optimization will be important.

Imbalanced DatasetThe dataset is highly imbalanced, with only 6% of samples belonging to the target (insincere) class.

I anticipate this will cause challenges with recall.

Maximizing recall, or true positive rate, could be a difficulty here due to the small number of insincere samples.

I anticipate resampling techniques and data augmentation could improve model performance.

The Importance of Learning from OthersAs Will Koerhrsen so beautifully put it in his Home Credit Default Risk article:Data scientists stand not on the shoulders of giants, but on the backs of thousands of individuals who have made their work public for the benefit of all.

He calls Kaggle competitions “collaborative projects” which is so true.

The Kaggle community is incredibly supportive and is a great place to not only learn new techniques and skills, but also to challenge yourself to improve.

Exploratory AnalysisThis first notebook is designed to get familiar with the problem at hand and devise a strategy for moving forward.

A great place to begin is to visualize the breakdown of our target.

Distribution of QuestionsFrom the plot above we can see we have a class imbalance problem.

Comparison of Number of Tokens per QuestionThe plot above shows a significant difference in number of tokens in each class.

This is confirmed by completing a t-test:T-Value: -106.

72P-Value: 0Number of Sentences per Questions (plotted on log scale)The difference in number of sentences per question is also confirmed as significant through a t-test:T-Value: -56.

09P-Value: 0Another interesting thing to look at are the most common words appearing in sincere and insincere questions.

Insincere Questions Most Common WordsSincere Questions Most Common WordsClass ImbalanceAs we saw above, we have a class imbalance problem.

Imbalanced classes are a common problem in machine learning classification where there are a disproportionate ratio of observations in each class.

(In this post I explore methods for dealing with class imbalance.

)With just 6.

6% of our dataset belonging to the target class, we can definitely have an imbalanced class!This is a problem because many machine learning models are designed to maximize overall accuracy, which especially with imbalanced classes may not be the best metric to use.

Classification accuracy is defined as the number of correct predictions divided by total predictions times 100.

For example, if we simply predicted that all questions are sincere, we would get a classification accuracy score of 93%!Competition MetricBefore moving on to creating baseline models, it’s important to understand our competition metric.

This competition uses the F1 score which balances precision and recall.

Precision is the number of true positives divided by all positive predictions.

Precision is also called Positive Predictive Value.

It is a measure of a classifier’s exactness.

Low precision indicates a high number of false positives.

Recall is the number of true positives divided by the number of positive values in the test data.

Recall is also called Sensitivity or the True Positive Rate.

It is a measure of a classifier’s completeness.

Low recall indicates a high number of false negatives.

Data PreparationBecause we have an imbalanced dataset, we will downsample the majority class to equal the size of the minority class.

This will not only balance our dataset, but will decrease processing time due to the decreased number of samples in the training data.

Before modeling, we apply some basic text preprocessing using Gensim.

Gensim is a great library for NLP — its super fast and provides tools for text clearning and n-gram generation — both of which we use in this baseline modeling kernel.

Baseline ModelsFor our baseline models we will try:Logistic Regression: A linear model for classification.

They are fast to train and predict, scale well, and are easy to interpret, and are therefore a good choice for a baseline model.

Naive Bayes: These classifiers are super fast to train and work well with high-dimensional sparse data, including text.

They are based on applying Bayes’ Theorem and are ‘naive’ in that they assume independence between features.

Scikit-Learn implements several types of Naive Bayes classifiers that are widely used for text data including Bernoulli (which we use here) and Multinomial.

XGBoost (Extreme Gradient Boosting): An implementation of gradient boosted decision trees designed for speed and performance.

As such, it often outperforms other algorithms — and seems to be a very popular choice in Kaggle competitions.

Ensemble Model: Scikit-learn states that “the goal of ensemble methods is to combine the predictions of several base estimators built with a given learning algorithm in order to improve generalizability / robustness over a single estimator.

” We will combine our above three models using scikit-learn’s voting classifier.

Interpreting the Results — Classification ReportPrecision is the number of true positives divided by all positive predictions.

Precision is also called Positive Predictive Value.

It is a measure of a classifier’s exactness.

Low precision indicates a high number of false positives.

Recall is the number of true positives divided by the number of positive values in the test data.

Recall is also called Sensitivity or the True Positive Rate.

It is a measure of a classifier’s completeness.

Low recall indicates a high number of false negatives.

F1-Score is the harmonic mean of precision and recall.

Support is the number of true results for each class.

Baseline Test Set ResultsLogistic Regression F1: 86.

9Naive Bayes F1: 86.

5XGBoost F1: 70.

9Ensemble F1: 86.

6These results look promising.

However, when submitted to the competition I got a public leaderboard score of 0.

483.

From this decrease in F1 score we can assume our models are not generalizing well to the unseen validation data.

We definitely have a lot of room for improvement!ConclusionThis article and introductory kernels show my start to a Kaggle competition and provides a baseline for improvement.

In future kernels and articles we will explore other resampling techniques and deep learning in attempt to improve our competition scores.

I welcome constructive criticism and discussion and can be reached on Twitter @terrah27.

.. More details

Leave a Reply