Action Movies vs Dramas: How do Their Scripts Differ?An analysis of the differences between action movies and dramas using Python and the Natural Language Toolkit (NLTK)Aditya KharosekarBlockedUnblockFollowFollowingMay 28Photo by Jakob Owens on UnsplashI like movies and I want to learn more about Natural Language Processing, so it just seemed a perfect fit to combine the two.
I decided to do some very basic analysis on the differences between two genres of movies—action movies and dramas—using Python and the Natural Language Toolkit (NLTK).
The goal was to gain more familiarity with NLTK, word tokenizing, frequency distributions, and other similar concepts.
I wanted to answer the following question: How do screenplays of action movies differ from screenplays of dramas?SummaryI analyzed ten scripts of action movies and ten scripts of drama movies.
If we include stopwords, action scripts have 1.
2% more words than drama scripts.
If we exclude stopwords, drama scripts have 1.
1% more words.
Drama scripts contain about 40% more occurrences of the word ‘I’.
Words in action scripts are between 6% and 9% longer on average than words in drama scripts.
This difference is due to action scripts having more long words.
Lexical diversity is a measure of the number of unique words in a text.
After lemmatization, drama scripts have slightly higher lexical diversity.
The DataThe first thing I had to decide was how many movies in each category I would consider.
The more movies I include in my analysis the better, but that would have been more of a text processing exercise.
I was more interested in playing around with NLTK, so I decided to only look at ten movies in each category, so ten dramas and ten action-adventure movies.
My process on deciding which ten movies I would include wasn’t even remotely scientific.
I decided to take those ten movies that I am most familiar with and which I think constitute a representative sample of that genre from the last four to five decades.
The movies I included in my analysis:Action-AdventureThe Dark KnightThe MatrixThe AvengersStar Wars Episode 5AvatarPirates of the Caribbean 1Terminator 2Indiana Jones and the Raiders of the Lost ArkStar TrekDie HardDramaThe Shawshank RedemptionForrest GumpThe GodfatherThe Godfather Part 2Schindler’s ListThe Green MilePulp FictionGood Will HuntingGoodfellas12 Angry MenNow, one obvious issue with my data collection procedure is the miniscule number of movies I looked at.
The small sample size means that any findings may not be representative of the movie genres as a whole.
Another issue is that the movie selection process is completely subjective and open to personal bias.
I admit that the small sample size is an issue.
I decided to go ahead anyway because I was more interested in getting familiar with NLTK and NLP than I was in generating a statistically significant result.
And with regards to the second issue, placing movies in buckets is an inherently subjective process.
It was more interesting to see the differences between the genres as I perceive them.
In the future, I will think about creating a dataset of a large number of movie screenplays which I could use for more robust analysis.
For the movies I selected, I manually searched for their screenplays on Google, and saved them each into a text file.
Average number of wordsThe first thing I wanted to look at was the length of screenplays.
My hypothesis was that action screenplays would have fewer words because they would tend to have less dialogue.
I used nltk.
tokenize to break the screenplays into words.
I then calculated the average length, in words, of an action movie screenplay and of a drama movie screenplay.
I performed this calculation using and excluding stopwords.
Stopwords are words which are primarily used as glue within sentences.
They’re usually the most common words in any sentence or larger work but they hold no special meaning in themselves.
Examples are ‘of’, ‘the’, ‘in’, ‘a’.
In NLTK, stopwords can be accessed using stopwords.
This returns a list of all stopwords according to NLTK, but as they’re all in lowercase, I used the .
lower() function when iterating through the words in the screenplays.
The results:Stopwords make a huge difference in the average length of screenplaysIf we exclude stopwords from our analysis, action movie screenplays have on average 270 more words than screenplays of dramas.
That is approximately a 1.
2% difference — not too much, but not zero either.
If we include stopwords, there’s a role reversal.
Drama movies now have about 380 words (1.
1%) more than action movies.
So my hypothesis was correct, but only if we included stopwords—and I think it is reasonable in this case to include stopwords as I was looking at the length of the screenplay as a whole.
As the presence of stopwords is enough to make one genre of screenplays longer than the other, I was interested in seeing which particular stopwords play a part in making this happen.
Let’s see what are the most common stopwords in both genres.
The Word ‘I’ is more common in dramasI used FreqDist to create a count of the number of occurrences for each stopword.
We can see that both genres have very similar distributions — ‘the’, ‘and’, ‘to’, ‘a’ are the most common words in both.
But interestingly, dramas have more occurrences of ‘I’.
In fact, dramas have about 40% more instances of ‘I’ than do action movies.
One reason for this could be that dramas may have more dialogue in them, and so characters are more likely to reference themselves with ‘I’.
Average length of wordsNext, I wanted to see if there is a difference in the word length between the two genres.
I used the in-built word tokenizer as in the previous section and calculated the average word length for each genre.
Action movies tend to have slightly longer wordsWords in action movies are about 6% longer than words in dramas if you include stopwords, and 9% if you don’t.
There’s no obvious reason why this would be the case, and in fact if you find the median word length in both cases, this difference goes away.
The median word length for both genres is three characters if you count stopwords, and four characters if you exclude them.
So it’s likely that there are some outliers which are leading to this difference.
Let’s see the distribution of word length for both genres:Longer words are more common in action movies than in dramas.
This is interesting because I would’ve thought that dramas movies, which could be seen to be more dialogue-driven, would be more likely to have longer words.
Lexical diversityOne final thing I wanted to look at was the lexical diversity, which is a measure of how many different words are used in a particular text.
Important to the concept of lexical diversity is the idea of lemmatization, which essentially is the process of converting various forms of a word to the root form.
For example: the words run, running, and ran are three different words but we can see that they are variations on the word run.
Lemmatization will convert each of these words to run.
Calculating lexical diversity without lemmatization, or its sister process stemming, will give an inaccurate metric because it will take the different forms of a word, say run, as entirely different words when in fact they are variations on the same word.
NLTK comes with an in-built Lemmatizer which can be accessed by using the following import statement from nltk.
stem import WordNetLemmatizer() and then by doing lemmatizer = WordNetLemmatizer().
To get the root form of a word, you simply pass that word to the lemmatize() function, for example: lemmatizer.
lemmatize('flying', pos='v')would return the word ‘fly’.
The ‘pos’ argument stands for ‘Part of Speech’.
The result of the lemmatizer for a given word depends on that word’s part of speech, as different parts of speech may have different root forms.
In fact, if you leave out the ‘pos’ argument in the last example, it will assume that the word you’re giving it is a noun and it will actually return ‘flying’ as the root form.
To associate each word with its correct part-of-speech tag, I used nltk.
This function takes a list of words and creates a tuple of the following form for each word: (word, pos tag).
I then iterated through each of these tuples, and passed the pos tag to the lemmatizer to get the appropriate lemma.
Finally, I calculated the lexical diversity by dividing the total number of lemmas by the number of distinct lemmas, giving me the average number of times that each unique lemma was found in that particular text.
For example, the script of Die Hard had a lexical diversity score of 4.
82, which means that on average, each unique word in Die Hard was used 4.
82, or slightly less than five, times.
The higher the lexical diversity score, the more linguistic variety in that text.
Let’s see if there are any differences between the two genres.
My hypothesis is that dialogue is the driving factor behind a substantial fraction of the different words found in a script and as action movies are less dialogue-driven, they will have lower lexical diversity.
There is a slight difference between the lexical diversity of an average action script and that of an average drama script (5.
32 for action vs 5.
42 for drama).
That’s only about a 2% difference, and because of the small sample size that I’ve taken, I cannot say whether this difference will hold true across the genres as a whole.
It may be a function of the particular scripts I selected.
A final interesting point: drama movies seem to have a wider range of diversity than action movies.
The orange dots in the above graph signify the lowest and highest diversity scores for each genre.
We can see that dramas have a wider range than action movies.
Again, this difference is probably due to the sample size, but it’s interesting nonetheless.
I hope you found this interesting.
Thanks for reading!.Feel free to leave a reply if you have any questions!.