Our model object is trained and we can now ask it to predict user ratings!Performance evaluationAsking the right questionsIn this article, we limit performance evaluation to a simple visual check since we are just looking to see if we can reject our understanding of how recommendation engines work (we are not calculating performance metrics or looking for the best model).
We only want to verify our model has:Learned the[product_id, rating] ”checker” patterns of each user category.
Makes sensible rating predictions when presented with combinations of [user_id, product_id] unseen during training.
This is done in 3 steps:Find all the [user_id, product_id] combinations where product_id has not been rated by user_id but has been rated by other users in the user_id’s group.
For each valid [user_id, product_id] combination, use the recommendation model to get a predicted_rating.
Plot the [product_id, predicted_rating] combinations for each user category (A, B, C), verify the training pattern has indeed been learned and used to predict the ratings.
To make a prediction with a scikit-surpise model:predicted_rating = model.
predict(str(user_id), str(product_id))Note: logically, the model should not be able to predict ratings for all possible [user_id, product_id] combinations.
In particular: the model should not be able to predict ratings for these cases:Unknown user_id or product_id: value not included in the training data: the model does not know what this user likes or who likes this product.
Unknown [user_id, product_id] association: the training data did not include a rating for this product_id coming from one of the users of this user_id‘s group.
ResultsThe dataframe below is a small extract of our predictions.
Note the was_impossible flag: I suspect it is set to TRUE if we asked one of these “impossible questions” as described above.
All of our flags were FALSE since we carefully selected the “possible” questions during our experiment.
Ratings predictions dataframeFinally and as promised, our visual check: here are the visual representations of the [product_id, predicted_ratings] mappings.
It is quite obvious that the model has learned the opposite “checker” patterns from the training data for categories A and B while it has found no pattern for user category C.
Visually, our expectations are met.
Ratings predictions for users from group ARatings predictions for users from group BRatings predictions for users from group CWe can also look at the averages of the actual vs predicted rating per product and for each user category.
The presence of a clear linear relationship between avg_rating avg_predicted_rating (with a slope close to 1) for categories A and B shows that the model learned the association between user_id and rating.
There was no rating pattern to learn for user group C (since this group was designed to have unpredictable taste).
There is almost no association between avg_rating and avg_predicted_rating for this user group.
Average actual vs.
predicted ratings predictions for users from group AAverage actual vs.
predicted ratings predictions for users from group BAverage actual vs.
predicted ratings predictions for users from group CThese are the results I would expected to see if my understanding (my «hypothesis») was correct.
So while I have not been able to reject my hypothesis – and am cognizant that it still could be wrong – the outcome of my experiment makes me more confident that my foundational understanding of recommendation engines is correct.
What would I do next?FirstNow that I have a basic foundational understanding of recommendation engines, I would go back to the theory and learn more about how the most popular algorithms work in details.
I would form an opinion on which algorithm should work best under which circumstances and why.
SecondI would want to find the best way to solve this simple use case.
As mentioned earlier, all we did was to visually check that “things made sense”.
Additional effort is required to go from here to a situation where we have a “good model”.
For this new goal, having a numeric performance metric becomes a must-have to navigate improvement candidates and progress efficiently.
A possible approach to get a basic metric (to be minimized) is to:Split the raw data between training and testing datasets.
Train using the training set — Duh :D.
Predict ratings using the [user_id, product_id] combinations from the testing dataset.
Calculate the average or sum of the absolute values of the normalized errors (ANE or SNE) between the predicted and actual ratings.
By normalized, we mean divided by the potential maximum error at a particular rating value.
Examples: a) for an actual rating of 0.
0 or 10 the maximum absolute value of the prediction error is 10 — normalize the error using a factor of 1/10.
b) for an actual rating of 5 the maximum absolute value of the prediction error is 5 — normalize the error using a factor of 1/5.
This is to make sure that the errors at the edges count as much as these at the center of the prediction range.
The ANE or SNE can then be minimized to find the optimal set of hyperparameters for the recommendation model and/or choose the best algorithm (we only evaluated the SVD in this post, there are other options).
Note: the normalize errors are a signed value, as such, I would also recommend keeping an eye on the normalized errors’ standard deviation or their density during the optimization process.
(Is less error on average but more spread a real improvement)?.