We now have a table called “movies_with_rating” that has all the IMDb movies with their ratings (and all the other information in the 2 tables we used).
Let’s take a look at the top row:We see some interesting information about the first movie in the table.
We don’t need all this data for our future models, however.
All we care about for this project is getting the top 10,000 movies by numVotes, and using movie keywords to find similarity between movies.
We can do that easily using the pandas’ .
We can now turn our attention to getting the plot information we want to use for our recommender models.
We can get such plot information using the wonderful IMDbPY API: https://imdbpy.
You will need to install the API using pip or conda.
Before using the API, we need to import it and instantiate an IMDb movie object.
We’ll do that and also import a couple other libraries that make running for-loops with an API a nicer experience:The syntax for pulling plot keywords from IMDb is as follows:In the above code, I first instantiate a dictionary to store the keywords I will get back from IMDb (in line 1).
I will use the movie IDs as keys and the list of keywords for each movie as values.
In the for loop beginning on line 2, I loop through the indexes of the top 10,000 movies.
On line 5, I use each movie index to get the corresponding IMDb movie object.
Note that on line 5 I subset the movie index from index 2 and on because the indexes start with ‘tt’ followed by a number and we only need the number.
I then subset the movie object by ‘plot outline’ to get the required plot outline, and then store the resulting keywords list in my keywords dictionary as a value corresponding to the movie index key.
Note my use of the tqdm and sleep libraries in the code above.
tqdm allows you to see where you are in the running of a for-loop.
When you are running a for-loop 10,000 times (as we are), it’s nice to know how far along you are.
The sleep library allows us to be courteous to IMDb.
Given that we are asking the website to give us information 10,000 times, it is nice to space our requests to it so that we do not overwhelm the server.
After finishing the for-loop, we have a dictionary where the keys are movie IDs, and the values are keywords lists.
It looks like this:We should convert this dictionary to a pandas DataFrame to take advantage of pandas’ abilities.
However, if we just do a simple pd.
DataFrame(keywords_dict), pandas will complain that our dictionaries aren’t the same length.
So we need a solution to get us what we want, as illustrated in the following code:In line 1, we create a new dictionary where we have the same keys but different values from our earlier dictionary.
Instead of the values being keywords lists, they are now pandas’ Series.
We can convert this dictionary to a DataFrame and then join all the keywords (which are now their own columns) with a simple lambda function in line 10.
After applying the lambda function we get back a Series, so we can convert this Series to a DataFrame, rename our columns, and save to a CSV.
We don’t need to save the plot outlines to a CSV, but anytime you obtain data from a process that takes a long time, it is very smart to save it so that you do not have to re-run the process that got you the data.
To use our plot outlines data, we can load it from the keywords.
csv file, rename the 2 columns that exist in the file (i.
the key and value) and join it to our existing table:And that’s it!We now have a dataset of plot text information that we can use to calculate similarity between movies.
EDABefore creating our models, let’s do some quick EDA.
Text data is not the easiest data to do exploratory data analysis with, so we’ll keep the EDA short.
The first thing we can do is take a look at the distribution for the number of characters in each keyword list.
Let’s first get rid of movies with no keywords:We can now find the length (in characters) of each keyword list:We see from the distribution and the average number of character per column that the distribution is quite right-tailed.
Let’s see if we get a similar-looking distribution if we plot the distribution for the number of words in each keyword list:Yep, also quite right-tailed.
So we looked at 2 different measures for our keyword lists — number of characters of the list and number of words in the list — and they showed the same pattern.
Which makes sense, as the measures are intuitively correlated.
What the right-tails tell us is that most movies do not have a lot of keywords or characters but a small few — presumably the most successful movies — have a lot of keywords and characters.
Creating ModelsWe can now create our models.
First, I would like to discuss a point about NLP and using text in general.
Ordinarily, when using text, you do not have nice, easy-to-use keyword lists.
This would have been the case for us if we had decided to use plot synopses instead of keyword lists (which would have happened if IMDb did not provide movie keywords).
In that case, there are a number of standard, simple steps you can take in order to essentially create your own keyword lists.
Removing stop words4.
Lemmatizing wordsThese can all be accomplished relatively simply using the nltk library.
Back to our models: We are going to create 3 different models, based both on different notions of similarity as well as different formulations of movie text ‘vectors’.
The first 2 models will use cosine similarity, and the last one will use Jaccard similarity.
Within the first 2 models, the first will use tf-idf to create movie vectors, and the second will use simple word counts.
We start by selecting the columns we will need.
Since we are creating keyword-based recommenders, we need the keywords column.
Because we also want to input a movie title and get back other movies, we will need the ‘primaryTitle’ column.
So we will select those 2 columns.
We also reset the index because our recommending function later will use a movie’s position in the DataFrame, not its IMDb ID, for recommendations:Next, we need to get a list of lists, where the inner lists are the lists of keywords for each movie.
What we have in our keywords column currently for each movie is a string with all the keywords in it and separated by commas.
We can get such a list of lists by using the nltk library:In line 2, we get the aforementioned list of strings.
In line 4, we import the work_tokenize module from the nltk library.
In line 5, we have a list comprehension where we loop through every keyword string and for each one we get the individual words with word_tokenize.
Note that word_tokenize(keyword.
lower()) returns a list of keywords, so we end up with a list of lists.
We are not quite done at this point since word_tokenize left us with many commas in our list of keyword lists.
We can easily get rid of them by defining a custom function, no_commas, and applying it to every keyword list in our list of keyword lists:Nice.
So we now have what we want in processed_keywords.
We can now create our first model, a tf-idf model that uses cosine similarity between word-vectors.
I’m not going to explain how those work here.
A quick Google search should return many helpful resources.
We start by creating a dictionary of words using gensim:The dictionary here just contains every word in our processed keywords list matched with an ID.
Next, we create a genism corpus, where corpus here just means ‘bag-of-words for each movie’:Next, we can convert these bags-of-words into tf-idf models using a genism tf-idf model:Finally, we create an index for our set of movie keywords that allows us to compute similarities between any given set of keywords and the keywords of every movie in our dataset:We are now at the point where we can write a function that accepts a movie, and returns the “n” most similar movies.
The way our function works will be as follows:1.
Accept a movie from a user2.
Accept “n” from the user where “n” is how many movies the user wants to be returned3.
Retrieve the movie’s keywords4.
Convert the movie’s keywords into a bag-of-words5.
Convert the bag-of-words representation into a tf-idf representation6.
Use the tf-idf representation as a query doc to query our similarity measure object7.
Get back the similarity results for every movie in our set, sort the set by decreasing similarity, and return the “n” most similar movies along with their similarity measures.
Do not return the most similar movie because every movie will be most similar to itselfIn code:Now let’s test it out.
Because the Avengers are so popular these days, let’s give the original movie to our function:All seem to be good matches.
Note that this kind of recommender system isn’t limited to just recommending movies based off of other movies.
We can query our similarity measure object using any given set of keywords.
It just so happens that most people will want to query it using movies they already know, but it is not limited to that.
Here’s the more general function for recommending based on provided keywords:We can move on to our second recommender, which also uses cosine similarity to calculate the similarity between word vectors.
It will differ from the first model by using simpler word counts instead of creating tf-idf word vectors.
This alternate implementation is performed with the CountVectorizer class in scikit-learn, which converts a collection of text documents (in this case, keyword lists) into a matrix of token counts.
We will also compute cosine similarity a little differently for simplicity.
Instead of using MatrixSimilarity, we will use the cosine_similarity function from the scikit-learn metrics.
We compute the word counts for each movie with the following code:Then, when given a movie, all we need to do is compute the cosine similarity between that movie’s word vector and every other word vector and return the most similar matches.
This is achieved with our cosine_recommender function:Let’s try it out with “The Avengers” as well:It works!.Similar matches to our first model.
For our third model, we use Jaccard similarity instead of cosine similarity.
As a result, we have no need for word vectors at all.
We simply calculate similarity as the intersection of the set of keywords divided by the union of the set of keywords.
The code for computing Jaccard Similarity between any two lists of keywords is straightforward:Creating a recommender model from this is correspondingly simple.
We find the keyword list for the given inputted movie, compute the Jaccard similarity between that list and every other movie keyword list and then rank the movies by their similarities and return the top “n” results.
In code:Testing this with our favourite movie, we get back good results:Next StepsThere are a number of ways we can improve the models as they currently exist.
At the top of the list is deploying the models with Flask so people can use them.
Usability is pretty important.
The next step would be to find some way of measuring performance for our models.
This is tricky because recommender systems often have no obvious evaluation criteria.
Still, we can come up with some (e.
using user feedback to rank models), and there are good resources out there for doing so.
A third improvement to consider is using deep-learning methods on plot summaries with embeddings: https://tfhub.
These methods should allow us to incorporate context into our recommendations, rather than simple individual words like our models currently use.
Lastly, the models could be improved by incorporating other non-text features such as genre or numeric features.
Check back in this space in a few weeks time.
I might do another blog post where I implement these improvements.
Please let me know what you think of the project and if you have any recommendations for improvement.
All constructive comments are much appreciated.
ReferencesFor inspiration for the use of the Gensim MatrixSimilarity class to compare documents, I used O’Rielly’s wonderful tutorial “How do I compare document similarity using Python”: https://www.
com/learning/how-do-i-compare-document-similarity-using-pythonFor inspiration for the use of Jaccard and cosine similarity recommenders, I am indebted to Sanket Gupta for his tutorial “Overview of Text Similarity Metrics in Python”: https://towardsdatascience.