Detecting the Cast in the Works of Anderson, Tarantino, and Ozu using Face RecognitionSeong KangBlockedUnblockFollowFollowingMay 31Motivation and ApplicationThis project was initially started as a fun side-project.
I was originally inspired by this video essay: http://kogonada.
Combining my interest in film and my desire to learn more machine-learning, I thought trying to figure out how to detect symmetrical scenes in a film would be an interesting challenge.
Ultimately, I’m interested in a system that can take a scene, detect and label specific film techniques, and identify the director or other directors with a similar style.
Since that would be a much denser project, I am hoping to use this smaller exercise as a learning tool to gather the experience and the knowledge as a building block for a bigger system.
This post will focus specifically on facial recognition using machine learning and how we may utilize it to analyze a film.
We will take images from each scene in a movie and use the face_recognition python package to detect faces from each image.
We then use their feature encodings to create clusters, and use this to estimate the size of the cast in a movie.
Combining this with metadata from each scene, we relate this back to the original film and how it’s connected to the style of each director.
DatasetWe used films from 3 different directors to run our analysis:Yasujiro Ozu The Flavor of Green Tea over Rice (1952) Tokyo Story (1953) An Autumn Afternoon (1962)Wes Anderson The Royal Tenenbaums (2001) The Life Aquatic with Steve Zissou (2004) Moonrise Kingdom (2012) The Grand Budapest Hotel (2014)Quentin Tarantino Reservoir Dogs (1992) Pulp Fiction (1994) The Hateful Eight (2015)Based to the availability of each movie, the video files themselves were in different formats and in different quality (this was also partially due to the challenge of limited disk space as explained later).
A screenshots extracted from each film, where we run the actual facial recognition, are all saved into .
jpg file formats, each film producing anywhere from 600 to 1500 images.
WorkflowFace EmbeddingsA package called face_recognition is used to detect the faces that are inside each image.
The face_recognition package has already been trained with over 3 millions images of faces to create a 128-dimensional embedding of faces.
Therefore, there was no additional tuning necessary, since it’s already been trained to extract the most relevant features of a human being’s face.
Once all the face embeddings of the labeled images have been created, they are pickled and saved for later.
The process of encoding the faces from the test set is identical to the training set except that, typically, the faces being detected are smaller than the training data since they stills from a scene.
Once all the faces have been extracted and encoded, the embeddings are saved so they can be compared against one another.
Comparing FacesThe DBSCAN algorithm is used to cluster the same cast members’ faces together from different scenes.
The DSCAN method from scikit-learn will map the encodings in a 128-dimensional space and group the closest points together.
A simplified version is demoed in scikit-learn documentation, which resulted in the image below.
This same algorithm is expanded to accommodate a much higher dimensional space.
In the case of Moonrise Kingdom, there were 13 major clusters, with all the major cast members being successfully recognized and grouped.
Edward Norton (left) and Kara Hayward (right)But there’s still room for improvement.
Here’s Bill Murray and Bruce Willis getting grouped together …along with a lightbulb(?)I’m actually not sure what that image near the top-left is supposed to be…FindingsSuccess RateThe facial recognition package offers two different methods to detect and encode faces in an image: HOG (Histogram of Oriented Gradients) and CNN (Convolutional Neural Network).
In order to compare how the HOG and CNN methods performed against one another, I went through each image for a movie and manually counted the number of faces.
I only did this for a single movie because it was too tedious to do it for any more.
For Moonrise Kingdom, here’s how the two methods compared:Both methods failed to detect a considerable amount of faces.
But the neural-network still performed much better than the HOG method albeit it also took a lot longer.
You can refer to the chart below for a histogram of how many faces the CNN method detected yet the HOG method did not.
Negative values indicate instances where the HOG-method actually performed better.
Strangely, however, we observed that the CNN-method wasn’t so good in clustering the faces into unique groups.
When using the face-encodings produced by the HOG detection method, the script identified 13 unique faces.
With the CNN-method however, it only identified 5.
The reason for this is unclear and would be a worthy topic to study in the futureUnique Faces Detected per MovieThe chart above shows the total number of unique faces detected broken down by each movie and grouped by directors.
The range of values per director is quite similar which is indicative of each director’s style, and a positive sign about the potential effectiveness of our face detection.
Director’s StylePart of the reason why Wes Anderson’s numbers are so high is because of how films are shot.
His shots are usually very flat.
Stills from the Royal Tenenbaums, The Life Aquatic with Steve Zissou, and Moonrise KingdomLike his narrative and dialogue style, his flat shots are direct, efficient, and playful.
This also just happens to lend itself well to our facial recognition algorithm because any time an actor is in a scene, they are almost always directly in front of the camera, looking straight ahead.
Initially, I expected something similar with Ozu’s movies.
His movies are known for scenes where characters are shot head-on, one at a time, during dialogue.
Stills from An Autumn Afternoon (Top) and Tokyo Story (Bottom)These shots are primarily used during dialogue between two to four characters.
The two screenshots on the left are from a single exchange, as is the three images on the right.
These shots are meant to humanize and create an intimacy with the characters on screen, exposing and capturing all the emotional details throughout the course of an exchange.
Due to numerous scenes like this, I expected the number of unique faces detected for Ozu’s works to be pretty straight forward (although still smaller than Wes Anderson’s numbers since Ozu typically uses a small cast).
But that wasn’t the case.
It’s clear that Ozu’s numbers are inaccurate and much smaller than the others.
This can be due to a number of reasons.
His movies are older so it may be due to the lack of quality in the images (Tokyo Story and “Green Tea” are also black and white).
Another potential reason is because the cast are all Asian.
The face_recognition package that we use were trained over 3 million images but I’m not familiar with how that training set is broken down by race.
It might be possible that the algorithm hasn’t been trained over enough Asian faces to be able to recognize them.
ChallengesData retrieving and consistencyIn order to accurately run the analysis, we’d ideally have consistent, high-quality sources of video.
This can be difficult depending on the film, particularly some of older, foreign-language ones.
Disk spaceFinding and getting the film can be difficult as is but they usually take up a lot of space.
In addition, we need additional space since these videos are then broken up into at least 1 image per scene.
Storing both the videos and the images can get pretty expensive.
Frontal vs ProfileThe face detection package we used almost strictly requires the images to show frontal views of faces.
Beyond a small range of angles, it fails to detect the features to define a face.
And unfortunately in movies, not all views of a character are full-frontal shots.
There are options to detect side “profile” of faces.
So potentially, it’s possible to combine that with the data of frontal shots but that wasn’t done yet as part of this analysis.
Issue of diversity and representationThis was briefly mentioned above under the discussion on Ozu’s movie analysis.
The lack of diversity in training data sets are an on-going topic of discussion and our skewed results here might be attributed to that same challenge.
In order to verify that the lack of faces detected in Ozu’s film was related to its mainly Asian cast, I ran a small test.
An Elaboration on Diversity with Crazy Rich AsiansOne theory as to why the cast number is so low for Ozu’s movies is that all the cast members are Asian.
If the training data of the face_recognition did not have enough Asian faces, it’s possible that it the package won’t be able to recognize those faces.
There is the other possibility that Ozu’s movies are older and thus, of lower quality.
A quick way to test this was to use a recent movie with plenty of Asian cast members to detect.
“Crazy Rich Asians” was a simple choice that immediately came to mind.
When we ran the same script for “Crazy Rich Asians”, it only detected 3 unique faces, which was clearly wrong.
Even funnier is who it recognized as unique faces:Two out of the three faces that were detected were the two white actors at the very beginning of the movie.
Both of whom appear for about 30 seconds in the whole movie.
The evidence seemed pretty clear at this point and it’s here that I learned another important lesson: read the documentation thoroughly.
Found in one of the FAQ sections of the face_recognition github: https://github.
com/ageitgey/face_recognition/wiki/Face-Recognition-Accuracy-ProblemsQuestion:Face recognition works well with European individuals, but overall accuracy islower with Asian individuals.
“This is a real problem.
The face recognition model is only as good as thetraining data.
Since the face recognition model was trained using publicdatasets built pictures scraped from websites (celebrities, etc),it only works as well as the source data.
Those public datasets are not evenlydistributed amongst all individuals from all countries.
I hope to improve this in the future, but it requires building a dataset ofmillions of pictures of millions of people from lots of different places.
Ifyou have access to this kind of data and are willing to share it for modeltraining, please email me.
”Next StepsThis was a fun exercise to familiarize myself with two new tools: PySceneDetect and face_recognition.
I was satisfied enough with how PySceneDetect was able to extract each scene so I can see myself continuing to use this tool in the future.
face_recognition on the other hand, did not perform up to expectations.
I will be searching for a better face detection tool in the future and would recommend others to do the same.
Some of the most helpful lessons were actually the challenges elaborated above.
It’s likely that we will consistently run into those challenges again no matter what film technique or data we try to extract.
There are several options for future exploration based on this study:Use other face detection packages and compare their performanceSupplement frontal face detection with profile face detection for higher accuracyOne of learnings from how face detection works was that the way an algorithm defines the features of a human face is not necessarily how we, as a person, would define a face.
Taking a similar approach, my next step will most likely be to extract features of a scene from different directors and group them together based stylistic similarities.
Source CodeComing Soon….
(I’m trying to clean up the code before sharing it on a public github page).. More details