He also notes what evaluation metric Kaggle will use to score submissions.
For this competition, Kaggle used multiclass log loss to measure the performance of submitted models.
Ideally, our multiclass classification model would have a log loss of 0.
Here’s more on log loss, if you’re interested.
PreprocessingNext, Abhishek uses the LabelEncoder() method from scikit-learn to assign an integer value to each author.
By encoding the text labels of values in the author column with integer values (0, 1, 2), Abhishek is making the data easier for his classification model to understand.
After encoding the author labels, Abhishek splits the data into training and validation sets using train_test_split from scikit-learn.
He opts for a 90:10 train/validation split (the most frequently utilized splits in Python data science typically range from 70:30 to 80:20).
So he intends to train the models on 90% of the sentences in the dataset, and then he’ll evaluate the accuracy of his models on the remaining 10% of the data.
Building a ModelBefore creating his first model, Abhishek uses TF-IDF (Term Frequency — Inverse Document Frequency) on the data.
TF-IDF will give weights to the words that appear in the sentences in the text column.
So TF-IDF will helps us understand what words are important when we are trying to determine which author wrote a particular sentence — words such as “the” won’t be important for classifying any author because “the” appears very frequently and doesn’t reveal much information, but a word like “Cthulhu,” for example, would be very important when classifying sentences written by H.
More about TF-IDF can be found here and here.
Running this TF-IDF on the data is a form of feature extraction.
Here, we needed to derive some sort of significant predictor or feature of the data that would help us figure out which author wrote a particular sentence.
With TF-IDF, we have a statistical measure of a word’s importance that can help us predict the author of the sentence.
After fitting the TF-IDF on both the training and validation sets, Abhishek prepares a logistic regression model.
If this type of classification model is new to you, read this before continuing.
After fitting the logistic regression model, Abhishek calculates the log loss of his logistic regression model (recall that he wrote the multiclass log loss function near the beginning of the kernel).
The multiclass log loss function returns a log loss value of 0.
626 for the logistic regression model.
Although fitting TF-IDF and a logistic regression model gave us a good start, we can improve on this log loss score.
Model TweakingSo we’re not satisfied with a log loss score of 0.
626 and want to optimize this evaluation metric.
From here, we could take a number of routes, and that’s exactly what Abhishek does.
After we’ve explored and preprocessed our data, we’re left with many different combinations of feature extraction and model fitting.
For example, Abhishek uses word counts for feature extraction instead of TF-IDF.
With this feature extraction technique, his logistic regression model’s log loss score improves from 0.
626 to 0.
528 — a whopping 0.
098 improvement!SummarySince Abhishek’s kernel grows increasingly more detailed from this point, I’ll let him do the heavy lifting with explaining the other classification models.
Here’s what we discussed:EDA: Exploratory data analysis is crucial if we want to understand the dataset, and EDA can save us time when we begin building modelsMulticlass classification problems: This type of problem requires us to predict which observations fall into which class, where each observation could fall into any one class of three or more classesPreprocessing: We have to preprocess our data before we build any models.
In this example, we needed to use LabelEndcoder() to transform the text labels into integer values for the sake of our modelsFeature Extraction: Whenever we have a dataset of raw data (sentence excerpts in our example), we’ll need to derive some predictor that can help us determine how to classify our observations.
Abhishek showed us how to use TF-IDF and word countsFrom here, it’s up to us to extract features with high predictive power and to pick models that match the problem and optimize the metric we’re concerned with.
Don’t be afraid to get your hands dirty and experiment with several models — you’re likely to fit a model that optimizes your evaluation metric through more experimentation.
I hope after reading this that you better understand how to approach an NLP problem and that you, too, appreciate Abhishek’s work.
AppendixAbhishek’s Kaggle profileAbhishek’s NLP kernelSpooky Authors datasetWhat is log loss?What is TF-IDF?TF-IDF in layman’s termsWhat is logistic regression?Here is all of Abishek’s code that is referred to in this article.
I want to reiterate that this is not my own work — this gist is intended to help beginners follow along to Abishek’s NLP tutorial.
Credit to Abhishek Thakur for this NLP tutorial.. More details