More advanced vectorization strategies such as Word2Vec?These are all questions that we’ll need to think about pretty much anytime we work with text data.
Exploring Data: Looking at our Pandas Dataframe we found…The first thing we did was check for non-values and dropped songs with NaN lyrics and after cleaning we still had 200,000 rows.
Then we looked at value counts for Genre decided to drop Folk, Indie, and Other because the first two didn’t have enough data and “Other” doesn’t provide any predictive value to our final classification task.
After all of this cleanup, we were left with eight basic genres: Rock, Pop, Hip Hop, Metal, Country, Jazz, Electronic, R&B.
These are the target classes that we will be trying to predict.
Distribution between genres was uneven, so we decided to randomly select 900 songs per genre giving us a total number of rows 900 songs * 8 genres = 7200 songs.
Feature Engineering and Model Optimization:We used a combination of NLTK, Pandas and Regex methods to:clean text from punctuation and odd charactersremove stopwordstokenize to only English wordsreturn a corpus of stemmed wordsreturn a corpus of lemmatized wordsappend the final clean lyrics back to the Pandas DataFrame2.
We used TF-IDF Vectorizer to turn words into a numerical representation of the importance of each word to a particular song lyric.
What is TF-IDF?TF-IDF stands for term frequency-inverse document frequency, and the TF-IDF weight is a weight often used in information retrieval and text mining.
This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus.
The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.
How is TF-IDF computed?Typically, the TF-IDF weight is composed of two terms: the first computes the normalized Term Frequency (TF), aka.
the number of times a word appears in a document, divided by the total number of words in that document; the second term is the Inverse Document Frequency (IDF), computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears.
TF: Term Frequency measures how frequently a term occurs in a document.
Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones.
Thus, the term frequency is often divided by the document length (aka.
the total number of terms in the document) as a way of normalization:TF = (Number of times term t appears in a document) / (Total number of terms in the document).
IDF: Inverse Document Frequency, which measures how important a term is to the meaning/content of a particular document in the corpus, compared to all other documents.
It is known that certain terms, such as “is”, “of”, and “that”, may appear very frequently in most documents, but that doesn’t give us any information on the importance of those commonly used words to a specific document’s meaning.
Thus we need to weigh down the excessively frequent terms while scaling up the rare ones which are specific to only a smaller number of documents, by computing the following:IDF = log_e(Total number of documents / Number of documents with term t in it).
After stemming and lemmatizing all the song lyrics and creating a features TF-IDF matrix we found ourselves with a final Pandas DataFrame of 7200rows and 30,000 columns.
Each row represents a particular song lyric and each column is a unique word and its corresponding TF-IDF value.
Training and Optimizing Our ModelsThe first thing we wanted to do was to test whether our basic ML models performed better with a corpus of stemmed or lemmatized text.
We trained and evaluated the performance of Multinomial Naive Bayes, Random Forest, AdaBoost, Gradient Boost, and K-Nearest Neighbor, using both stemmed and lemmatized words.
The chart below shows our results:We chose to go with lemmatized words over stemmed words because every model consistently performed at least 1% better when using lemmatized text.
From here we opted to focus on model optimization for our top three models -Multinomial Naive Bayes, Gradient Boost, and Random Forest.
Next thing we did was PCA where we ran a test on our data to see how many components would preserve 80% of the variation.
Then we ran PCA with n_components = 1800 on our top three models to see if that improved performance.
The graph below shows the result:As you can see from the graph, PCA didn’t improve performance in either model, so we decided to not use PCA moving forward.
Next things we wanted to do was GridSearch on the three top-performing models and pick the model with the combination of parameters that yielded the highest accuracy score.
Summary of results below:Grid Search on the Random Forest improved performance from 41% to 43% accuracy.
Grid Search on the Gradient Boost improved performance from 45% to 50% accuracy.
Grid Search on Naive Bayes Grid Search did not generate improved performance because the default parameters are optimal.
Interpreting and communicating the final results:Below you can see the graph of our top three models’ Final Performance after optimization and hyperparameter tuning using GridSearch.
Our highest model, GradientBoost after Grid Search yielded 50% accuracy, which is just about four times better than random guessing (guessing a random class out of 8 possible classes = 1/8 or 12.
Even though 50% is not a stellar number, we were still impressed that given only 7200 lyrics we were able to train a model that can correctly guess what Genre a song belongs to 50% of the time by only scanning that song’s lyrics.
From experimenting with Grid Search and PCA optimizations, we found that Multinomial Naive Bayes was the fastest and simplest to use model right out of the box.
Without any extra optimization techniques, it yielded only 5% less accuracy than the top model — GradientBoosted Classifier.
Conclusion:Based on our fun experiment, it appears that there is a certain set of vocabulary, which is specific to each song genre that can allow one to train an ML Algorithm that can guess a song’s genre only by analyzing its lyrics.
Another interesting finding was that the NaiveBayes Classifier seemed to generate a very strong performance right out of the box.
Thus, if you are working with a very large text dataset, where feature generation and model optimization prove to be computationally expensive and time-consuming, you might opt to use Naive Bayes for simplicity and efficiency, without sacrificing performance too much.
If you have sufficient time and computational power and you want to optimize performance as much as possible, then running a GridSearch on a bunch of ensemble models such as Random Fores or GradientBoosted Classifiers would be the way to go.
Fun Add-On: Using an Unsupervised Learning model to identify distinctive topics and keywords for each genreWe used gensim.
Dictionary to create a frequency dictionary for the lemmatized, tokenized word set.
We grabbed keywords from each genre and generated a Topic Model score.
Using the Word2Vec Dictionary generator, we ran a Topic Modeling LDA algorithm and printed the word clouds for the top Keywords in each genre below.
.. More details