A lot of US-based users, for instance, probably think Hollywood when they think movies, and don’t want to be returned foreign flicks (The Chaos Class is a Turkish comedy) and maybe would be confused if a silent movie from 1891 was recommended right after Avengers: Endgame or Lion King.
Fortunately, this problem is easily solved through further filtering.
For instance, here’s another popularity chart accounting for movies only released in 1990 or later.
Note that because I’m changing the shape of my initial set, the thresholds for my qualified movies will change.
In this case, my threshold of total votes changed from 7,691 to 11,707 votes, and I looked at movies with an unweighted average rating of 6.
13 or above.
Top 12 ‘popular’ movies from 1990 onWe can also choose to filter further, creating popularity lists for unique countries, directors, actors, genres, languages, and more, making sure to change the thresholds accordingly each time.
For these filters, in python, since often you’ll have in your dataframe several items per feature (eg: The Dark Knight has action, crime, drama, and thriller all attributed as genres) a solution is using pandas.
stack, which will allow you to then include all movies, filtered by only one subset of your target feature.
Top 10 movies for Japanese language and top 10 movies for ComedyOverall, popularity or Top N lists are a great starting point for recommendation systems, whether you’re compiling them for a personal blog or movie recommendations, with some key points to keep in mind:By default, Top N lists might not return the metrics you want.
For film buffs or critics looking for their next diamond in the rough, a popularity chart is not going to cut it.
It returns the popular, and the known.
Remember the kernel density estimate plot above, with the long tail?Accounting for our thresholds on the total set of 7,691 votes and an average unweighted rating of 6.
14, the initial qualifying list contains 6,532 movies.
Recall that the entire dataset has 265,000 movies (more precisely, 265,417).
This means that only 2.
46% of movies were even considered for these Top N charts (and remember, 1.
5% of movies in my entire scraped dataset accounted for nearly 80% of the total votes).
Not exactly the metrics we were hoping for in terms of coverage.
Other metrics are a bit more tricky to look at in this case: we have diversity as much as we look for it.
Our very first chart has different countries, years, genres, actors.
Filtering down ‘Japanese comedies’ might give us less diversity, while also noting that this is what the user is looking for.
Novelty, in a popularity chart, is also probably lacking.
Someone might look at a 250 Top Movies chart and think, ‘Oh yeah, I forgot about that movie.
’ But they probably won’t be surprised to see Titanic or Pulp Fiction.
The second thing to consider with popularity charts is that they are 100% unpersonalized.
If you implement a popularity filter or chart on your blog or website, everyone who comes to that website will see the same results.
Looking at Reddit, the top posts in the history of the website are all years old, which is why you aren’t presented with those years-old posts when you visit their ‘hot’ page, whose algorithm accounts for and weights by the age of the post and generally only displays posts that are less than 6 hours old.
The last thing to consider with straight popularity charts is that we’re making an assumption about what people like.
The entire idea behind a popularity recommender is that because a lot of people liked it, a person at random will also like it.
This obviously isn’t always the case.
Just because 1,000,000 people rave about The Lord of the Rings: Fellowship of the Ring, doesn’t make a person who dislikes the franchise suddenly like it.
I don’t like horror movies, and no metric of popularity is going to make me change my mind.
Additionally, while further-filtered charts do address this, I doubt very much a parent of a five-year-old would look at our very first chart and find anything suitable to watch with their child.
Overall, popularity charts are: simple, easy to implement, and good first steps into recommending products, pages, or other services to users.
Popularity charts are not: personalized, deep-diving, or going to recommend to you the movie you never knew you needed (unless that movie is Avatar).
Thanks for reading!.Leave any questions in the comments below, and check out my python notebook or my github repo for this project if you’re so inclined.
My next post will deal with content-based recommenders: using movie metadata and tags such as genre, MPAA rating, plot keywords, cast & crew, languages, and more to recommend movies using NLP vectorization and distance functions, while considering scaleability.