However, it should be noted that each of these matrices are randomly initialized.

Therefore, in order for these predictions and embeddings to accurately characterize both users and items, they must be trained using a set of known ratings so that the accuracy of predictions generalizes to ratings that are not yet known.

Such a model can be implemented with relative ease using the Embedding class in PyTorch, which creates a 2-dimensional embedding matrix.

Using two of these embeddings, the probabilistic matrix factorization model can be created in a PyTorch module as follows:Figure 6: Simple matrix factorization implementationThe matrix factorization model contains two embedding matrices, which are initialized inside of the model’s constructor and later trained to accurately predict unknown user-item ratings.

In the forward function (used to predict a rating), the model is passed a mini-batch of index values, which represent the identification numbers (IDs) of different users and items.

Using these indices, passed in the “cats” parameter (an Nx2 matrix) as user-item index pairs, the predicted ratings are obtained by indexing the user and item embedding matrices and taking the inner product of the corresponding vectors.

Adding BiasThe model from Figure 6 is lacking an important feature that is needed for the creation of a powerful recommendation system — the bias term.

The bias term is basically a constant value that is assigned to each user and item in the system.

When the predicted rating for a user-item pairing is computed, the bias for both the user and item are added to the predicted rating value.

This process can be observed below:Figure 7: Adding bias to the Probabilistic Matrix Factorization ModelAs can be seen above, the amount of change needed to add bias into the model is quite small.

An extra value for each user/item must be tracked and added to the result of each prediction made by the model.

These bias values will be trained alongside the actual embeddings to yield accurate predictions.

Adding bias to the model in PyTorch requires that two extra embeddings be created — one for the user bias and one for the item bias.

These embeddings will have a single column and each row will represent the bias value for the user or item with an ID corresponding to that row index.

These bias values, similarly to the embedding vectors, should then be found using the indices passed in the “cats” parameter in the forward method.

The implementation can be seen below:Figure 8: Implementing a Probabilistic Matrix Factorization Model with biasAs seen in the above implementation, two new embeddings were added into the model for the user and item bias.

The values from these bias vectors were then added into the result of each prediction inside of the forward method.

Now, a complete probabilistic matrix factorization model has been implemented — this can be used to create very powerful recommendation systems that can add value to almost any business.

However, you might be thinking that this model looks pretty similar to the last one, so why do we bother adding bias into the model?.Does it really make it any better?Why do we even need bias?We will first consider bias in terms of each user.

If a user has a very high bias value, what does this mean?.Typically, users are assigned high bias values if they often give high ratings to items, while users that often assign low ratings are given lower bias values.

This makes sense, because a user that rates items very high on average should be biased towards higher ratings and vic versa.

Such bias can, in turn, allow the model to separate user tendencies in rating items from their actual item preferences by just learning a proper bias value.

Similarly, items that are given typically high ratings are assigned large biases, while items with low average ratings are given low values.

This allows items that are highly rated by users to be biased towards high ratings, as most users tend to enjoy such items, and vic versa.

This can be visualized in the following figure:Figure 9: Bias values for different items.

Highly-rated items are given high bias values and vic versa.

The average ratings of users or items create unneeded noise within the dataset.

By including bias, the model to can learn to characterize users and items separately from such noise, allowing the user and item embeddings to more accurately reflect the properties and preferences of each user or item.

Creating a Music Recommendation SystemNow that the model has been created, I am going to use it to create a music recommendation system.

For this project, I utilized a data set of users and musicians, where each user-musician pairing is assigned a value based on the amount of time a user has spent listening to an artist.

I then used the probabilistic matrix factorization model from Figure 8 to create a music recommender model from this dataset.

Simple EDAI began this experiment by doing some simple exploratory data analysis (EDA) on the data set using pandas.

The first step in my analysis was examining a sample of the data and checking for any null values.

There were no null values present, and the data appeared as follows:Figure 10: A sample from the original music data setThis data set appears to resemble a normal recommendation system dataset, which is (obviously) well suited for our model.

Each datum contains a userID, an artistID, and a value representing the user’s rating of that artist.

However, the weight values in this data set, representing the user-artist ratings, are quite large because they are determined by the number of minutes a user has spent listening to an artist.

A typical recommendation system dataset contains ratings between 1–5, binary ratings (i.

e.

0 or 1), or something along these lines.

Therefore, the rating values in this data set should be normalized such that they are within a smaller range, which can be done as follows:Figure 11: Normalizing weight values within the data setAfter weight values were normalized, I began to examine the sparsity of the dataset, or the number of known ratings in comparison to the number of possible user-item combinations.

Originally, only .

2% of possible ratings within the data set were known, which is quite sparse and could negatively impact the accuracy of the model.

Therefore, I chose to filter the data set to only include only users that had rated at least five artists.

Figure 12: Eliminating users with fewer than five ratings from data setAfter the above operation was performed on the dataset, it increased the density of the data set to 1.

3%, which, while still relatively sparse, will allow the matrix factorization model to make more accurate predictions.

After these changes were made, there was one last change I wanted to make before fitting the model.

Within this data set, each user and artist is assigned a unique ID.

However, in order to make using the recommendation model easier, I wanted to make all of these IDs contiguous, such that they can be used to index into the embedding matrices.

This can be accomplished with the following code, which ensures that all users and artists are given a contiguous, unique ID:Figure 13: Ensuring that user and item IDs are contiguousThis code creates a mapping, for both users and items, from the original ID to the new contiguous ID, such that all IDs fall within the range [0, total number of users/artists].

By performing this conversion, the IDs for both users and artists can be used as an index into embedding matrices for easy look up and quick predictions.

Hyperparameter Selection/TuningNow that the data has been filtered and preprocessed, the recommendation model can actually be trained.

However, training the model has a couple of hyper-parameters that must be set properly: the learning rate and the latent dimensionality.

To determine the optimal learning rate, I utilized an adaptive learning rate selection technique described in the fast.

ai deep learning course.

This technique trains the model for several iterations, increasing the learning rate used for updating the model’s parameters on every iteration.

The loss is then recorded for each iteration and displayed on a graph, as seen in the following figure:Figure 14: Graph of the Loss (y-axis) vs.

Learning Rate (x-axis)In the above figure, the optimal initial learning rate is represented by the largest value for the learning rate before the loss begins to increase, which, in this case, was around 0.

1.

Therefore, the learning rate was initially set to .

1 when the model was trained.

Determining the optimal latent dimensionality was done through grid search using values of 20, 30, 40, 50, and 60.

After running three epochs of training with each of these embedding sizes, the results were as follows:Figure 15: Grid Search for the Optimal Embedding SizeAfter performing the grid search, an embedding size of 40 was chosen for the music recommendation system, as it had the minimal validation loss after three epochs of training.

Fitting the ModelNow that the hyper-parameters have been selected, the model can be trained.

Training was done three epochs at a time, and the learning rate was reduced by a factor of ~2 every three epochs until the model converged.

By gradually decreasing the learning rate, a simple learning rate scheduler was created that allowed the model to fine-tune its parameters and minimize loss as much as possible.

The loss for the model throughout the entire training process can be seen in the following figure:Figure 16: Training the Recommender Model with Multiple Learning RatesThe model was trained for 3 epochs with learning rates of .

1, .

05, .

01, .

005, and .

001, resulting with a final MSE loss of 0.

75.

In other words, all predictions made for a user-artist pair had an average error of about 0.

86.

Given that all ratings in the training and testing datasets are within the range [0, ~60], an average error of .

86 is relatively low, which hints that the model fit the data relatively well!Extra AnalysisAlthough probabilistic matrix factorization works well for predicting user-item ratings, one of the most interesting aspects of the model is the embeddings that it creates to quantitatively describe each user and item.

These embeddings can be used to gain insights about users or products, as input into other machine learning models (such as a deep neural network), or even to determine which items users enjoy the most!.Just for fun, I performed some extra analysis on the embeddings that were created through training this model, in order to see if they carried any interesting information.

More specifically, I examined the bias values produced for each artist in the dataset, which characterize, in general, all users’ preference for each artist.

After sorting bias values with their associated artists, it yielded the following result:Figure 17: Final Bias Values for ArtistsAs can be seen, the artists with the highest bias values are well-known and successful musicians, such as Britney Spears and U2, thus demonstrating the usefulness of information contained within the model’s embeddings!ConclusionThank you so much for reading, and I hope you now have a better understanding of recommendation systems, how they work, and how you can implement probabilistic matrix factorization yourself!.If you are interested in exploring the extra details of this project, I encourage you to check out the GitHub repository that I created, which contains the full notebooks with all of my codes used in implementing the recommendation system.

Additionally, feel free to follow me on LinkedIn or on Medium to stay updated with my future articles and work.

Sources/Citations[1] https://johnolamendy.

wordpress.

com/2015/10/14/collaborative-filtering-in-apache-spark/.