Brazilian Heavy Metal: An Exploratory Data Analysis using NLP and LDAAn analysis across the lyrics of two of greatest bands of Brazilian Heavy Metal: Angra and SepulturaFlávio ClésioBlockedUnblockFollowFollowingJul 3Let’s see how to get thereMotivationIn most of my time, I get used to experimenting some NLP techniques and I noticed that even with the plethora of resources available, it’s very hard to find some NLP tech stuff attached with Data Analysis, i.
related to general knowledge over the data like Text Mining.
It’s very cool to have a lot of scripts, applied blog posts, repositories in Github with code, but at least for me the analysis it’s where the technique shines most because anyone it’s able to write a script but only a few ones can extract knowledge of the data.
The idea here it’s getting the lyrics of two bands that I like and check their literary characteristics and try to find some relation or distinction between them.
For very deep and technical posts about NLP, LDA and so on, feel free to jump directly to the end of this post and choose a lot of very nice references about these topics.
And this is what this post is about, and was deeply inspired for a great job of the Machine Learning Plus.
Why Angra and Sepultura?Heavy Metal it’s one of the most borderless music styles in the world and I would like to show two of the most iconic bands of my country and their literary characteristics in a simple way using Python, LDA, NLP and some imagination (you will see during the “interpretation” of topics.
About the BandsSepulturaSepultura is a Brazilian heavy metal band from Belo Horizonte.
Formed in 1984 by brothers Max and Igor Cavalera, the band was a major force in the groove metal, thrash metal, and death metal genres during the late 1980s and early 1990s.
Sepultura has also been credited as one of the second wave of thrash metal acts from the late 1980s and early-to-mid 1990s.
Source: Jamie MartinezSepultura Oficial Website — Sepultura in SpotifyAngraAngra is a Brazilian heavy metal band formed in 1991 that has gone through some line-up changes since its foundation.
Led by Rafael Bittencourt, the band has gained a degree of popularity in Japan and Europe.
Source: sztachetkiAngra Oficial Website — Angra in SpotifyQuestionsSome personal questions that I always had about these bands and I’ll try to answer with this notebook is:1) What're the literary characteristics for Angra and Sepultura?2) Which type of thematics did they talk about?3) Who has more diversity in their topics?Some limitationsNLP it’s still an unsolved problem even with all over promising about it.
This two anthological pieces by Yoav Goldberg and The Gradient put that in perspective;The creative process even with some patterns it’s a very complex that can involve a lot of poetic licenses.
In this video, Rafael Bittencourt explains the whole process to compose a single lyric for the new album, and in this video, Max Cavalera speaks about the creative process behind the classic album “Roots” from 1996.
Applied techniquesNatural Language ProcessingNatural language processing (NLP) is a subfield of computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data.
Topic ModelingIn machine learning and natural language processing, a Topic Model is a type of statistical model for discovering the abstract “topics” that occur in a collection of documents.
Topic modeling is a frequently used text-mining tool for the discovery of hidden semantic structures in a text body.
Topic models are also referred to as probabilistic topic models, which refers to statistical algorithms for discovering the latent semantic structures of an extensive text body.
Latent Dirichlet AllocationIn natural language processing, Latent Dirichlet allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar.
For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word’s presence is attributable to one of the document’s topics.
LDA is an example of a topic model.
Code & DataAll code, datasets, and images are stored in this Github repo.
Data extraction and loadTo extract the lyrics I used PyLyrics library using this script.
Important: This library doesn't have any update/bug fix since last year.
Below we can import some libraries and start a small pre-processing over our data.
The wrapper fetched 325 songs bringing the artist, lyric, and album.
One of the main challenges it’s that these bands usually write songs in multi-language (EN and PT-BR).
For a matter of simplicity, we’ll concentrate only on the EN language.
To filter the PT-BR songs I’ll use textblob library that uses the Google API to check the language.
The main caveat it's if you re-run a lot of times maybe you receive the code HTTP Error 429: Too Many Requests.
Here we can see that from 325 lyrics 38% is from Angra and 62% is from Sepultura.
Angra has 96% (119) of all lyrics in EN and Sepultura have 96% (194) of all lyrics in EN.
The most remarkable song in PT-BR from Angra (IMHO) it’s Caça e Caçador what was a song from the album Hunters and Pray.
In the Temple of Shadows album the song Late Redemption it’s a good piece in EN/PT-BR.
Sepultura has some songs in PT-BR like Filhos do Mundo from Bestial Devastation, Prenuncio from Against and A Hora E A Vez Do Cabelo Nascer from the Beneath the Remains album that is a cover song from Mutantes.
The most remarkable one in PT-BR it’s the Policia song.
With all PT-BR lyrics removed let’s perform a quick check in all albums of these bands.
In a first look, considering our dataframe, we can see the first difference between these two bands that Sepultura has a larger discography and more songs per album.
This can be explained with the fact that even both bands faced a hiatus in the time of that they changed their main singers (Andre Matos and Edu Falaschi in Angra and Max Cavalera by Sepultura) Sepultura released 8 albums after their break (all of them with Derrick Green) and in meanwhile Angra released 6; and Sepultura it’s a more prolific band.
Let’s keep that information in mind because maybe it can be explained in the second moment in this analysis.
Let’s check the average songs per album.
As we visually inspected Sepultura not even have more albums, but have more songs per album.
To start our analysis one important aspect of Text Analysis it’s the data pre-processing.
Here we’re literally can screw all analysis because the Pre-Processing it’s responsible to remove all noise of the data and normalize all data to get meaningful results.
Kavita Ganesan made a great analysis of this topic and I strongly recommend the read.
The first step will be to remove all English stopwords of all lyrics.
PS: Personally I don’t like to use off-the-shelf stopwords list because every domain demands specific words subsets to define if some word it’s important or not.
But let’s keep that way for a matter of simplicity.
This nice text of Martina Pugliese explains it in detail.
In terms of implementation this article of ML Whiz probably its the best resource available on the internet.
After the stopwords removal, let’s perform a quick visual check on the most frequent words used by these two bands.
In other words: What’s most used words in their compositions?If I could to perform some classification here to define Angra Lyrics based in their most common expressions would be like that:Time relation: Time, Day, Wait, NightFeelings: Like, Heart, Soul, LieMovement and distance: Come, Way, Away, CarryLiving and mind: Life, Let, Know, Dream, Mind, Live, inside, LeaveThe absolute state of the world: DieTypical Heavy Metal cliché expression: OhNow, a quick verification in Sepultura lyrics:State of the modern world: Death, War, Hate, Die, Lie, Pain, Lose, Fear, Blood, WorldAction, distance and time: Way, Come, Stop, Make, Rise, Time, Think, Hear, KnowMind issues: Live, Mind, Feel, Want, Look, InsideLatent DifferencesSome latent differences about the themes discussed between Sepultura and Angra arise like:The axis of compositional literature of Sepultura converges on subjects related to the theme of things/feelings linked to death, pain, war, hatred (if you already don’t know, Sepultura means “Grave” in PT-BR) which are considered the most aggressive/heavy themes;Angra has a lighter theme talking more about existential issues that involve the passage of time, as well as some songs that have feelings linked to dreams and feelings linked to internal conflicts.
Let’s see the word cloud relative to the most frequent words of the two bands, only for a small comparison according to all the vocabulary used by the bands.
Main words Angra:life, time, know, heart, day, away, know, dreamMain words Sepultura:way, life, death, world, mindLexical DiversityAccording to Johansson (2009) Lexical diversity is a measure of how many different words that are used in a text.
The practical use of the Lexical Diversity it’s given by McCarthy and Jarvis(2010) they said that LD is the range of different words used in a text, with a greater range indicating a higher diversity.
A special consideration here it’s that Heavy Metal songs it is not supposed to contain a lot of different words, i.
a great lexical richness.
It’s because most of the cases each band can follow a single artistic concept and shape their creative efforts to some themes and of course because most of the time this kind of song has many choruses.
For example (regarding of band concept) Avatasia it’s a supergroup of Heavy Metal that talks about fiction, fantasy and religion; and in the other side Dream Theater talk about almost everything since religion until modern politics.
With this disclaimer let’s check the Lexical Diversity of this bands.
There are almost no lexical diversity differences between these two bands, i.
even using different words to shape their themes, there are no substantial lexical differences between them in terms of frequency in their themes.
Word N-GramsAccording to Wikipedia, n-gram is a contiguous sequence of n items from a given sample of text or speech.
The items can be phonemes, syllables, letters, words or base pairs according to the application.
In other others, n-grams it’s sequences that contain n_ words that can be used to model the probability of some sequence appears in a corpus, in our case, the n-gram(s) we can examine the most frequent combination of _n words in their literary dictionary.
For a matter of simplicity will focus on combinations of Bigrams n=2 e Trigrams n=3.
N-Grams AngraHere we can see some things:The “You’re” it’s the top combination in n=2.
This indicates that along with the whole corpus of Angra's songs, there are lyrics that contain some kind of message being given to another person.
One of the most frequent bigrams is carry on but this has a reason for the data: In this data set we have the Angels Cry and Holy Live disks that contain the song Carry On and this causes a double counting;The reason behind the me cathy and cathy come bigrams appears it's because of a cover song called Wuthering Heights from Kate Bush that repeats this chorus a lot;We have the traditional heavy metal song chorus filler cliché oh oh appearingAgain appears in carry on in carry on time, on time forget, remains past carry,The word Cathy appears in the Trigrams: heathcliff me cathy, me cathy come, cathy come homeSome bizarre pattern like ha ha ha, probably because of the data cleansingN-Grams SepulturaIn these bi-grams, we can already see a little more of Sepultura’s theme linked to themes related to brutality as I had put it before.
Some mentions:The song “Choke” has a very repetitive chorus, and that contributes to this composition of bi-grams.
The same thing with the classic “Roots” that has a very striking chorusLet’s go to the tri-grams:Here we see basically the same pattern with part of the tri-grams facing some very striking choruses.
Now we know a little about the theme of the two bands, however, a question that follows is: Within this theme what are the latent topics behind each composition?, i.
, there is a diversity within the themes inside the band’s concept?, what if we could group these songs according to their literary composition?And here’s where we’re going to use LDA.
LDAFirst, we will filter each of the artists within their respective dataframe:To do topic distinction I’ll arbitrarily choose 7 topics for each artist (it can be more or less) only for didactic purposes and maintenance of simplicity.
In other words: Given all Angra and Sepultura lyrics, what are the top 7 topics that they usually write more?Topics AngraTopics SepulturaTopic DistributionTopic Distribution AngraWe can see here that much of Angra’s thematics is focused on topics 4, 0, 2 which I call the topics Look and know about the world along the time (#4), Face the pain along the time (#0), and In life dreams come and go away (#2)Topic #4: eyes time life ive world love say inside know gotTopic #0: time dont day away way youre face pain just causeTopic #2: let come like away day life wont wonder cold dreamsRight now let’s check Sepultura:Topic Distribution SepulturaSepultura focus on some topics such as life and fear time away(#3), being alive in a world with pain and death(#5), and Living in a world with war and blood spillingTopic #3: dont just away time fear youre know life right lookTopic #5: end theres dead world death feel eyes pain left aliveTopic #0: war live world hear trust feel walk believe blood killWord per TopicsHere is only a table for us to check the order of the words in the topics that permeate the literary part of these two bands.
A special highlight here is that in this dataframe is also considered the frequency and ordinality of the word within the topic.
Word per Topics AngraWord per Topics SepulturaTopic Plotting with word distributionHere with the pyLDAvis library, we can take a look at how the topics are distributed via visual inspection.
The graph presented by pyLDAvis it's called Intertopic Distance Map that consists in a two-dimensional plane whose centers are determined by computing the distance between topics, and then by using multidimensional scaling to project the intertopic distances onto two dimensions.
With that, we’re able to interpret the composition of each topic and which individual terms are most useful inside of some topic.
A most comprehensive introduction about the LDAVis can be found in Sievert, C.
, & Shirley, K.
LDAvis: A method for visualizing and interpreting topics.
In Proceedings of the workshop on interactive language learning, visualization, and interfaces (pp.
LDA Plot for all Angra’s TopicsLDA Plot for all Sepultura's TopicsConclusionInitially, I had 3 questions in mind and about the whole trip using NLP and LDA I personally think that I have some answers for them.
1) What’re the literary characteristics for Angra and Sepultura?Answer: Angra has as main literary characteristics themes related to the time and how the soul and life, mind and fate and waiting for something.
Sepultura has a more aggressive literary composition where they speak about death, war and pain and they sing several times with some lyrics that face death.
They protest against a lost or sick world most of the time.
2) Which type of thematics did they talk about?Answer: Angra: Time, soul and fate.
Sepultura: Death, War, and World of Pain.
3) Who has more diversity in their topics?Answer: Using an arbitrary number 7 of topics we can see that Sepultura has more diversity in terms of the distribution of topics.
Further ideas and TODOsInclude track namesCompare Sepultura Eras (Max — Derrick)Compare Angra Eras (Mattos — Falaschi — Lioni)Similarity between tracks (Content-Based)Sepultura/Angra LSTM music lyric generatorTopic evolutionDominant topic per albumLyric Generation using LSTMReferences and useful linksMachine Learning Plus — LDA in Python — How to grid search best topic models?Susan Li — Building a Content Based Recommender System for Hotels in SeattleSusan Li — Automatically Generate Hotel Descriptions with LSTMShashank Kapadia — End-To-End Topic Modeling in Python: Latent Dirichlet Allocation (LDA)Meghana Bhange — Arctic Monkeys Lyrics Generator with Data AugmentationGreg Rafferty — LDA on the Texts of Harry PotterCode Academy — Using Machine Learning to Analyze Taylor Swift’s LyricsAlexander Bell — Music Lyrics Analysis: Using Natural Language Processing to create a Lyrics-Based Music RecommenderTrucks and Beer — Amazing project with Lyrics scrapperAnders Olson-Swanson — Natural Language Processing and Rap LyricsBrandon Punturo — Drake — Using Natural Language Processing to understand his lyricsDegenerate State — Heavy Metal and Natural Language Processing — Part 1Degenerate State — Heavy Metal and Natural Language Processing — Part 2Degenerate State — Heavy Metal and Natural Language Processing — Part 3Packt_Pub — Generating Lyrics Using Deep (Multi-Layer) LSTMMohammed Ma’amari — AI Generates Taylor Swift’s Song LyricsNotebook Taylor Swift’s Song Lyrics — Link in Colabenrique a.
— Word-level LSTM text generator.
Creating automatic song lyrics with Neural NetworksIvan Liljeqvist — Using AI to generate lyricsSarthak Anand — Music GeneratorFranklin Wang — song-lyrics-generationTony Beltramelli — Deep-Lyrics.. More details