Recommendation Systems — Models and EvaluationNeerja DoshiBlockedUnblockFollowFollowingJun 19, 2018I’ve been involved in building several different types of recommendation systems, and one thing I’ve noticed is that each use case is different from the next, as each aims to solve a different business problem.
Let’s consider a few examples:Movie/Book/News Recommendations — Suggest new content that increases user engagement.
The aim is to introduce users to new content that may interest them and encourage them to consume more content on our platform.
Stock Recommendations — Suggest stocks that are most profitable to the clients.
The recommendations may be stocks that they have traded in historically.
Novelty does not matter here; profitability of the stock does.
Product Recommendations — Suggest a mix of old and new products.
The old products from users’ historical transactions serve as a reminder of their frequent purchases.
Also, it is important to suggest new products that the users may like to try.
In all of these problems, the common thread is that they aim to increase customer satisfaction and in turn drive business in the form of increased commissions, greater sales, etc.
Whatever the use case may be, the data is typically in the following format:Customer ID, Product ID (Movie/Stock/Product), No: of Units/Rating, Transaction DateAny other feature like the details of the product or demographics of the customerGoing forward, here are the topics I will be covering:Methods used for building recommendation systems — Content-based, Collaborative Filtering, ClusteringEvaluation Metrics — Statistical accuracy metrics, Decision Support accuracy metricsThings to keep in mindhttp://cleverdata.
io/basket-analysis-machine-learning/MethodsThere are 2 major approaches for building recommendation systems — content-based and collaborative filtering.
In the following section, I will discuss each one of them and when they are suitable.
com/an-overview-of-recommendation-systems/Content basedThe gist of this approach is that we match users to the content or items they have liked or bought.
Here the attributes of the users and the products are important.
For example, for movie recommendations, we use features such as director, actors, movie length, genre, etc.
to find similarity between movies.
Furthermore, we can extract features like sentiment score and tf-idf scores from movie descriptions and reviews.
(The tf-idf score of a word reflects how important a word is to a document in a collection of documents).
The aim of content-based recommendation is to create a ‘profile’ for each user and each item.
Consider an example of recommending news articles to users.
Let’s say we have 100 articles and a vocabulary of size N.
We first compute the tf-idf score for each of the words for every article.
Then we construct 2 vectors:Item vector: This is a vector of length N.
It contains 1 for words that have a high tf-idf score in that article, otherwise 0.
User vector: Again a 1xN vector.
For every word, we store the probability of the word occurring (i.
having a high tf-idf score) in articles that the user has consumed.
Note here, that the user vector is based on the attributes of the item (tf-idf score of words in this case).
Once we have these profiles, we compute similarities between the users and the items.
The items that are recommended are the ones that 1) the user has the highest similarity with or 2) has the highest similarity with the other items the user has read.
There are multiple ways of doing this.
Let’s look at 2 common methods:Cosine Similarity: To compute similarity between the user and item, we simply take the cosine similarity between the user vector and the item vector.
This gives us user-item similarity.
To recommend items that are most similar to the items the user has bought, we compute cosine similarity between the articles the user has read and other articles.
The ones that are most similar are recommended.
Thus this is item-item similarity.
Cosine similarity is best suited when you have high dimensional features, especially in information retrieval and text mining.
Jaccard similarity: Also known as intersection over union, the formula is as follows:This is used for item-item similarity.
We compare item vectors with each other and return the items that are most similar.
Jaccard similarity is useful only when the vectors contain binary values.
If they have rankings or ratings that can take on multiple values, Jaccard similarity is not applicable.
In addition to the similarity methods, for content based recommendation, we can treat recommendation as a simple machine learning problem.
Here, regular machine learning algorithms like random forest, XGBoost, etc.
, come in handy.
This method is useful when we have a whole lot of ‘external’ features, like weather conditions, market factors, etc.
which are not a property of the user or the product and can be highly variable.
For example, the previous day’s opening and closing price play an important role in determining the profitability of investing in a particular stock.
This comes under the class of supervised problems where the label is whether the user liked/clicked on a product or not(0/1) or the rating the user gave that product or the number of units the user bought.
Collaborative FilteringThe underlying assumption of the collaborative filtering approach is that if A and B buy similar products, A is more likely to buy a product that B has bought than a product which a random person has bought.
Unlike content based, there are no features corresponding to users or items here.
All we have is the Utility Matrix.
This is what it looks like:A, B, C, D are the users, and the columns represent movies.
The values represent ratings (1–5) a user has given a movie.
In other cases, these values could be 0/1 depending on whether the user watched the movie or not.
There are a 2 broad categories that collaborative filtering can be split into:Memory based approachFor the memory based approach, the utility matrix is memorized and recommendations are made by querying the given user with the rest of the utility matrix.
Let’s consider an example of the same: If we have m movies and u users, we want to find out how much user i likes movie k.
This is the mean rating that user i has given all the movies she/he has rated.
Using this, we estimate his rating of movie k as follows:Similarity between users a and i can be computed using any methods like cosine similarity/Jaccard similarity/Pearson’s correlation coefficient, etc.
These results are very easy to create and interpret, but once the data becomes too sparse, performance becomes poor.
Model based approachOne of the more prevalent implementations of model based approach is Matrix Factorization.
In this, we create representations of the users and items from the utility matrix.
This is what it looks like:Thus, our utility matrix decomposes into U and V where U represents the users and V represents the movies in a low dimensional space.
This can be achieved by using matrix decomposition techniques like SVD or PCA or by learning the 2 embedding matrices using neural networks with the help of some optimizer like Adam, SGD etc.
For a user i and every movie j we just need to compute rating y to and recommend the movies with the highest predicted rating.
This approach is most useful when we have a ton of data and it has high sparsity.
Matrix factorization helps by reducing the dimensionality, hence making computation faster.
One disadvantage of this method is that we tend to lose interpretability as we do not know what exactly elements of the user/item vectors mean.
ClusteringClustering is typically used when your recommendation problem is going to be unsupervised.
If you are early into the business and have very less historical/labeled data you can cluster the observations based on the feature set and then assign recommendations to clusters based on the labels that exist in that cluster.
This solution of course does not give the best results right away but is a good starting point for such cases until enough data is acquired.
Clustering can also be used to generate meta features to the observations.
For example, after clustering, I can assign values from 1-k as a new feature ‘cluster’ to every observation and then train my main model on all the features.
This can be done at user level or product level.
Evaluation MetricsA major obstacle while designing recommendation systems is choosing what metrics to optimize.
This can be tricky because in a lot of cases, the goal is to NOT recommend all the same products that the user has bought before.
So how do you know if your model is doing a good job at suggesting products?Statistical accuracy metricsThey are used evaluate accuracy of a filtering technique by comparing the predicted ratings directly with the actual user rating.
Mean Absolute Error (MAE), Root Mean Square Error (RMSE) and Correlation are usually used as statistical accuracy metrics.
MAE is the most popular and commonly used; it is a measure of deviation of recommendation from user’s actual value.
MAE and RMSE are computed as follows:The lower the MAE and RMSE, the more accurately the recommendation engine predicts user ratings.
These metrics are good to use when the recommendations are based on predicting rating or number of transactions.
They give us a sense of how accurate our prediction ratings are, and in turn how accurate our recommendations are.
Decision support accuracy metricsThe popular ones among these are Precision and Recall.
They help users select items that are more similar among available set of items.
The metrics view prediction procedure as a binary operation which distinguishes good items from those items that are not good.
Let’s see them in more detail:Recall@k and Precision@k:These are the go-to metrics used for recommendation systems.
Let us begin with understanding what precision and recall mean for recommendation systems:But, precision and recall don’t seem to care about ordering.
So instead we use precision and recall at cutoff k.
Consider that we make N recommendations and consider only the first element, then only the first two, then only the first three, etc… these subsets can be indexed by k.
Precision and Recall at cutoff k, P@k, and r@k, are simply the precision and recall calculated by considering only the subset of your recommendations from rank 1 through k.
The rank of the recommendations is determined by the predicted value.
, the product with the highest predicted value is ranked 1, the product with the kth highest predicted value is ranked k.
Average Precision:If we have to recommend N items and there are m relevant items in the full space of items, Average Precision AP@N is defined as:where rel(k) is just an indicator (0/1) that tells us whether that kth item was relevant and P(k) is the precision@k.
If we would have recommended 2N items instead of N, the AP@N metric says we only care about the average precision up to the Nth item.
AP rewards you for giving correct recommendations,AP rewards you for front-loading the recommendations that are most likely to be correct,AP will never penalize you for adding additional recommendations to your list — just make sure you front-load the best ones.
Mean Average Precision:AP applies to single data points, like a single user.
MAP@N just goes a step further and averages the AP across all users.
Additional SuggestionsOne suggestion (this is a personal one that I am not too certain about) is to look for content in the ground truth that has not come from the users’ historical data.
This content is then compared with the non-historical content in the recommendations using any of the standard metrics explained above.
This gives us an idea of how good our model is in recommending products that do not directly come from historical transactions.
Things to keep in mindWhile developing a recommendation system, especially for content based recommendation, it is important to remember NOT to optimize only for a single metric.
That is to say, for news article recommendation, it is not advisable to get a very high recall, because this means that we are recommending the users the content they would have naturally consumed without our recommendation.
Thus, we won’t be driving business in any way.
We need to ensure a decent recall/precision as an indicator that our model is able to learn the users’ preferences, not try to make it as high as possible.
Also, we don’t want to lose user engagement in the long run by recommending the same types of things over and over again.
Thus is highly probable if we try to maximize precision@k/recall@k.
Thank you for reading!.I will be happy to receive any feedback or additional information in the comments!ReferencesYannet Interian’s lectures at USF on Recommendation SystemsThis awesome post that explains MAPMy notebook for an illustration of collaborative filtering (matrix factorization) in PyTorchMore on collaborative filtering here:https://towardsdatascience.
com/various-implementations-of-collaborative-filtering-100385c6dfe0About Me: Graduated with MS Data Science at USF and undergrad in Computer Science, I have 2 years of experience in building predictive and recommendation algorithms, and deriving business insights for finance and retail clients.
I am excited about opportunities for applying my machine learning and deep learning knowledge to real-world problems.
Do check out my other blogs here!LinkedIn: https://www.
com/neerjad/Discuss this post on Hacker News.
Editor’s Note: Ready to dive into some code?.Check out Fritz on GitHub.
You’ll find open source, mobile-friendly implementations of the popular machine and deep learning models along with training scripts, project templates, and tools for building your own ML-powered iOS and Android apps.
Join us on Slack for help with technical problems, to share what you’re working on, or just chat with us about mobile development and machine learning.
And follow us on Twitter and LinkedIn for the all the latest content, news, and more from the mobile machine learning world.