Evaluating A Real-Life Recommender System, Error-Based and Ranking-Based

Photo credit: PixabayEvaluating A Real-Life Recommender System, Error-Based and Ranking-BasedA recommender system aims to find and suggest items of likely interest based on the users’ preferencesSusan LiBlockedUnblockFollowFollowingJan 16Recommender system is one of the most valuable applications in machine learning today.

Amazon attributes its 35% of revenue to its recommender system.

Evaluation is an integral part of researching and developing any recommender system.

Depends on your business and available data, there are many ways to evaluate a recommender system.

We will try a few today.

Rating PredictionIn my last post re: Building and testing recommender systems with Surprise, Surprise centered around various machine-learning algorithms to predict user ratings of items (i.


ratings prediction).

It requires users to give explicit feedback, such as ask users to rate a book on a scale of 0 to 10 stars after they have bought.

And then we use this data to build the profile of users’ interests.

The problem with this is that not everyone is willing to leave a rating, so data is tend to be sparse, like this Book Crossing data set we have seen before:Figure 1Most recommender systems attempt to predict what the user would put in them if they rated the corresponding books.

With too many “NaN”s, the recommender won’t have enough data to understand what the user likes.

However, explicit rating is great if you can convince your users to give ratings to you.

Therefore, if you have the luxury of data and user ratings, then the evaluation metrics should be RMSE or MAE.

Let’s show an example of Movielens dataset with Surprise library.


pyTop-NTop-N recommender systems are everywhere from online shopping websites to video portals.

They provide users with a ranked list of N items they will likely be interested in, in order to encourage views and purchases.

One of Amazon’s recommender systems is “Top-N” systems that produce a list of top results to individuals like so:Figure 2Amazon’s “Top-N” recommendations for me includes 9 pages and there are 6 items on the first page.

A good recommender system should be able to identify a set of N items that will be of interest to a certain user.

Because I seldom buy books at Amazon, my “Top-N” is way off.

In another word, I would probably only click or read one of these books on my “Top-N” list.

The following scripts produced the top-10 recommendations for each user in the test set.


pyHere is the top-10 we predicted for userId 2 and userId 3.

Figure 3Hit RateLet’s see how good our top-10 recommendations are.

To evaluate top-10, we use hit rate, that is, if a user rated one of the top-10 we recommended, we consider it is a “hit”.

The process of compute hit rate for a single user:Find all items in this user’s history in the training data.

Intentionally remove one of these items ( Leave-One-Out cross-validation).

Use all other items to feed the recommender and ask for top 10 recommendations.

If the removed item appear in the top 10 recommendations, it is a hit.

If not, it’s not a hit.


pyThe whole hit rate of the system is the count of hits, divided by the test user count.

It measures how often we are able to recommend a removed rating, higher is better.

A very low hit rate simply means we do not have enough data to work with.

Just like Amazon’s hit rate to me would be terribly low because it does not have enough of my book purchase data.

Hit Rate by Rating ValueWe can also break down hit rate by predicted rating values.

Ideally, we want to predict movies user like, so we care high rating values not low ones.


pyOur hit rate breakdown is exactly what I’d hoped, the hit rate for rating score 5 is much higher than 4 or 3.

Higher is better.

Cumulative Hit RateBecause we care about higher ratings, we can ignore the predicted ratings lower than 4, to compute hit rate for the ratings > = 4.


pyHigher is better.

Average Reciprocal Hit Ranking (ARHR)Commonly used metric for ranking evaluation of Top-N recommender systems, that only takes into account where the first relevant result occurs.

We get more credit for recommending an item in which user rated on the top of the rank than on the bottom of the rank.

Higher is better.


pyYour first real-life recommender system will likely to be of low quality, as well as your mature recommender system for new users.

but still, it is much better than no recommender system at all.

One of the objectives of a recommender system is that we learn preferences of users/new users in recommender systems so that they can begin receiving accurate personalized recommendations from the system.

However, if you just started out, your website is completely new, recommender systems cannot serve anybody with personalized recommendations, devoid of any evaluations from anybody.

Then, this becomes a systemic bootstrapping problem.

Jupyter notebook can be found on Github.

Enjoy the rest of the week.

Reference: Building Recommender Systems with Machine Learning and AI.

. More details

Leave a Reply