Unsupervised Classification Project: Building a Movie Recommender with Clustering Analysis and K-MeansVictor RomanBlockedUnblockFollowFollowingMar 19IntroductionThe goal of this project is to find out the similarities within groups of people in order to build a movie recommending system for users.

We are going to analyze a dataset from Netflix database to explore the characteristics that people share in movies’ taste, based on how they rate them.

Data will come from the MovieLens user rating dataset.

Dataset OverviewThis dataset has two files, we will import both and work with both of them.

# Import Librariesimport pandas as pdimport matplotlib.

pyplot as pltimport numpy as npfrom scipy.

sparse import csr_matrixfrom mpl_toolkits.

axes_grid1 import make_axes_locatablefrom sklearn.

cluster import KMeansfrom sklearn.

metrics import mean_squared_errorimport itertoolsfrom sklearn.

metrics import silhouette_samples, silhouette_score%matplotlib inline# Import the Movies datasetmovies = pd.

read_csv('ml-latest-small/movies.

csv')movies.

head()# Import the ratings datasetratings = pd.

read_csv('ml-latest-small/ratings.

csv')ratings.

head()We will want to find out how the structure of the dataset works and how many records do we have in each of these tables.

# Print the number of records and the total number of moviesprint('The dataset contains: ', len(ratings), ' ratings of ', len(movies), ' movies.

')Romance versus Science FictionWe will start by considering a subset of users and discovering what are their favourite genre.

We will do this by defining a function that will calculate each user’s average rating for all science fiction and romance movies.

# Function to get the genre ratingsdef get_genre_ratings(ratings, movies, genres, column_names): genre_ratings = pd.

DataFrame() for genre in genres: genre_movies = movies[movies['genres'].

str.

contains(genre) ] avg_genre_votes_per_user = ratings[ratings['movieId'].

isin(genre_movies['movieId'])].

loc[:, ['userId', 'rating']].

groupby(['userId'])['rating'].

mean().

round(2) genre_ratings = pd.

concat([genre_ratings, avg_genre_votes_per_user], axis=1) genre_ratings.

columns = column_names return genre_ratings# Calculate the average rating of romance and scifi moviesgenre_ratings = get_genre_ratings(ratings, movies, ['Romance', 'Sci-Fi'], ['avg_romance_rating', 'avg_scifi_rating'])genre_ratings.

head()In order to have a more delimited subset of people to study, we are going to bias our grouping to only get ratings from those users that like either romance or science fiction movies.

# Function to get the biased datasetdef bias_genre_rating_dataset(genre_ratings, score_limit_1, score_limit_2): biased_dataset = genre_ratings[((genre_ratings['avg_romance_rating'] < score_limit_1 – 0.

2) & (genre_ratings['avg_scifi_rating'] > score_limit_2)) | ((genre_ratings['avg_scifi_rating'] < score_limit_1) & (genre_ratings['avg_romance_rating'] > score_limit_2))] biased_dataset = pd.

concat([biased_dataset[:300], genre_ratings[:2]]) biased_dataset = pd.

DataFrame(biased_dataset.

to_records()) return biased_dataset# Bias the datasetbiased_dataset = bias_genre_rating_dataset(genre_ratings, 3.

2, 2.

5)# Printing the resulting number of records & the head of the datasetprint( "Number of records: ", len(biased_dataset))biased_dataset.

head()We can see that there are 183 number of records ,and for each one, there is a rating for a romance and science fiction movie.

Now, we will make some Visualization Analysis in order to obtain a good overview of the biased dataset and its characteristics.

# Defining the scatterplot drawing functiondef draw_scatterplot(x_data, x_label, y_data, y_label): fig = plt.

figure(figsize=(8,8)) ax = fig.

add_subplot(111) plt.

xlim(0, 5) plt.

ylim(0, 5) ax.

set_xlabel(x_label) ax.

set_ylabel(y_label) ax.

scatter(x_data, y_data, s=30)# Plot the scatterplotdraw_scatterplot(biased_dataset['avg_scifi_rating'],'Avg scifi rating', biased_dataset['avg_romance_rating'], 'Avg romance rating')The biase that we have created previously is perfectly clear now.

We will take it to the next level by applying K-Means to break down the sample into two distinct groups.

# Let's turn our dataset into a listX = biased_dataset[['avg_scifi_rating','avg_romance_rating']].

values# Import KMeansfrom sklearn.

cluster import KMeans# Create an instance of KMeans to find two clusterskmeans_1 = KMeans(n_clusters=2)# Use fit_predict to cluster the datasetpredictions = kmeans_1.

fit_predict(X)# Defining the cluster plotting functiondef draw_clusters(biased_dataset, predictions, cmap='viridis'): fig = plt.

figure(figsize=(8,8)) ax = fig.

add_subplot(111) plt.

xlim(0, 5) plt.

ylim(0, 5) ax.

set_xlabel('Avg scifi rating') ax.

set_ylabel('Avg romance rating')clustered = pd.

concat([biased_dataset.

reset_index(), pd.

DataFrame({'group':predictions})], axis=1) plt.

scatter(clustered['avg_scifi_rating'], clustered['avg_romance_rating'], c=clustered['group'], s=20, cmap=cmap)# Plotdraw_clusters(biased_dataset, predictions)It is evident that the grouping logic is based on how each person rated romance movies.

People that averaged a rating on romance movies of 3 or higher will belong to one group, and people who averaged a rating of less than 3 will belong to the other.

We will see now what happen if we divide the dataset into three groups.

# Create an instance of KMeans to find three clusterskmeans_2 = KMeans(n_clusters=3)# Use fit_predict to cluster the datasetpredictions_2 = kmeans_2.

fit_predict(X)# Plotdraw_clusters(biased_dataset, predictions_2)It is evident now that the science-fiction rating has started to come into play:People who like scie-fi and romance belong to the yellow group.

People who like scifi but not romance belong to the green group.

People who like romance but not sci-fi belong to the purple group.

Let us see what happens if we add another group.

# Create an instance of KMeans to find three clusterskmeans_3 = KMeans(n_clusters=4)# Use fit_predict to cluster the datasetpredictions_3 = kmeans_3.

fit_predict(X)# Plotdraw_clusters(biased_dataset, predictions_3)From this analysis we can realize that the more groups we split our datset into, the more similar are the preferences of the people that belong to each group.

Choosing the Right K Number of ClustersAs we discussed on the article “Unsupervised Machine Learning: Clustering Analysis”:Choosing the right number of clusters is one of the key points of the K-Means algorithm.

To find this number there are some methods:Field knowledgeBussiness decisionElbow MethodAs being aligned with the motivation and nature of Data Science, the elbow mehtod is the prefered option as it relies on an analytical method backed with data, to make a decision.

Elbow MethodThe elbow method is used for determining the correct number of clusters in a dataset.

It works by plotting the ascending values of K versus the total error obtained when using that K.

The goal is to find the k that for each cluster will not rise significantly the varianceIn this case, we will choose the k=3, where the elbow is located.

To better understand this method, when we talk about variance, we are referring to the error.

One of the ways to calculate this error is by:First, subtracting the Euclidean distance from each point of each cluster to the centroid of its respective group.

Then, squaring this value (to get rid of the negative terms).

And finally, adding all those values, to obtain the total error.

So, now we want to find out the right number of clusters for our dataset.

To do so, we are going to perform the elbow method for all the possible values of Kl which will range between 1 and all the elements of our dataset.

That way we will consider every possibility within the extreme cases:If K = 1, there is only one group which all the points belong to.

If K = all data points, each data point is a separate group.

# Selecting our dataset to studydf = biased_dataset[['avg_scifi_rating','avg_romance_rating']]# Choose the range of k values to test.

# We added a stride of 5 to improve performance.

We don't need to calculate the error for every k valuepossible_k_values = range(2, len(X)+1, 5)# Define function to calculate the clustering errorsdef clustering_errors(k, data): kmeans = KMeans(n_clusters=k).

fit(data) predictions = kmeans.

predict(data) #cluster_centers = kmeans.

cluster_centers_ # errors = [mean_squared_error(row, cluster_centers[cluster]) for row, cluster in zip(data.

values, predictions)] # return sum(errors) silhouette_avg = silhouette_score(data, predictions) return silhouette_avg# Calculate error values for all k values we're interested inerrors_per_k = [helper.

clustering_errors(k, X) for k in possible_k_values]# Plot the each value of K vs.

the silhouette score at that valuefig, ax = plt.

subplots(figsize=(16, 6))plt.

plot(possible_k_values, errors_per_k)# Ticks and gridxticks = np.

arange(min(possible_k_values), max(possible_k_values)+1, 5.

0)ax.

set_xticks(xticks, minor=False)ax.

set_xticks(xticks, minor=True)ax.

xaxis.

grid(True, which='both')yticks = np.

arange(round(min(errors_per_k), 2), max(errors_per_k), .

05)ax.

set_yticks(yticks, minor=False)ax.

set_yticks(yticks, minor=True)ax.

yaxis.

grid(True, which='both')Looking at the plot, we can see that the best choices of the K values are: 7, 22, 27, 31.

Increasing the number of clusters beyond that range result in worst clusters according to the Silhouette Score.

We will chose the K = 7 as it is the one that yields the best score and will be easier to visualize.

# Create an instance of KMeans to find seven clusterskmeans_4 = KMeans(n_clusters=7)# Use fit_predict to cluster the datasetpredictions_4 = kmeans_4.

fit_predict(X)# Plotdraw_clusters(biased_dataset, predictions_4, cmap='Accent')Adding Action to Our AnalysisUp to now, we have only analyzed romance and science-fiction movies.

Let us see what happens when adding other genre to our analysis by adding Action movies.

# Select our biased dataset and add action genrebiased_dataset_3_genres = get_genre_ratings(ratings, movies, ['Romance','Sci-Fi', 'Action'], ['avg_romance_rating', 'avg_scifi_rating', 'avg_action_rating'])# Drop null valuesbiased_dataset_3_genres = bias_genre_rating_dataset(biased_dataset_3_genres, 3.

2, 2.

5).

dropna()# Print the number of records and the head of our datasetprint( "Number of records: ", len(biased_dataset_3_genres))biased_dataset_3_genres.

head()# Turn dataset into a listX_with_action = biased_dataset_3_genres[['avg_scifi_rating', 'avg_romance_rating', 'avg_action_rating']].

values# Create an instance of KMeans to find seven clusterskmeans_5 = KMeans(n_clusters=7)# Use fit_predict to cluster the datasetpredictions_5 = kmeans_5.

fit_predict(X_with_action)# Define 3d plotting functiondef draw_clusters_3d(biased_dataset_3, predictions): fig = plt.

figure(figsize=(8,8)) ax = fig.

add_subplot(111)plt.

xlim(0, 5) plt.

ylim(0, 5) ax.

set_xlabel('Avg scifi rating') ax.

set_ylabel('Avg romance rating')clustered = pd.

concat([biased_dataset_3.

reset_index(), pd.

DataFrame({'group':predictions})], axis=1)colors = itertools.

cycle(plt.

rcParams["axes.

prop_cycle"].

by_key()["color"])for g in clustered.

group.

unique(): color = next(colors) for index, point in clustered[clustered.

group == g].

iterrows(): if point['avg_action_rating'].

astype(float) > 3: size = 50 else: size = 15 plt.

scatter(point['avg_scifi_rating'], point['avg_romance_rating'], s=size, color=color)# Plotdraw_clusters_3d(biased_dataset_3_genres, predictions_5)Here, we are still using the x and y axes of the romance and sci-fi ratings.

In addition, we are plotting the size of the dot to represent the ratings of the action movies (the bigger the dot the higher the action rating).

We can see that with the addition of the action genrem the clustering vary significantly.

The more data that we add to our k-means model, the more similar the preferences of each group would be.

The bad thing is that by plotting with this method we start loosing the ability to visualize correctly when analysing three or more dimensions.

So, in the next section we will study other plotting method to correctlyy visualize clusters of up to five dimensions.

Higher-Level ClusteringOnce we have seen and understood how the K-Means algorithm group the users by their movie genre preferences, we are going to take a bigger picture of the dataset and explore how users rate individual movies.

To do so, we will subset the dataset by ‘userid’ vs ‘user rating’ as follows.

# Merge the two tables then pivot so we have Users X Movies dataframeratings_title = pd.

merge(ratings, movies[['movieId', 'title']], on='movieId' )user_movie_ratings = pd.

pivot_table(ratings_title, index='userId', columns= 'title', values='rating')# Print he number of dimensions and a subset of the datasetprint('dataset dimensions: ', user_movie_ratings.

shape, '?.Subset example:')user_movie_ratings.

iloc[:6, :10]Having a look at this subset of the datset, it is evident that there are a lot of ‘NaN’ values as most of the users have not rated most of the movies.

This type of datasets with a number that high of ‘null’ values are called ‘sparse’ or ‘low-dense’ datasets.

In order to deal with this issue, we will sort the datsaset by the most rated movies and the users that have rated the most number of movies.

So we will obtain a much more ‘dense’ region at the top of the dataset.

# Define the sorting by rating functiondef sort_by_rating_density(user_movie_ratings, n_movies, n_users): most_rated_movies = get_most_rated_movies(user_movie_ratings, n_movies) most_rated_movies = get_users_who_rate_the_most(most_rated_movies, n_users) return most_rated_movies# choose the number of movies and users and sortn_movies = 30n_users = 18most_rated_movies_users_selection = sort_by_rating_density(user_movie_ratings, n_movies, n_users)# Print the resultprint('dataset dimensions: ', most_rated_movies_users_selection.

shape)most_rated_movies_users_selection.

head()Now, we wil want to visualize it.

As we have a high number of dimensions and data to be plotted, the preferred method on this situations are the ‘heatmaps’.

# Define the plotting heatmap functiondef draw_movies_heatmap(most_rated_movies_users_selection, axis_labels=True): fig = plt.

figure(figsize=(15,4)) ax = plt.

gca() # Draw heatmap heatmap = ax.

imshow(most_rated_movies_users_selection, interpolation='nearest', vmin=0, vmax=5, aspect='auto')if axis_labels: ax.

set_yticks(np.

arange(most_rated_movies_users_selection.

shape[0]) , minor=False) ax.

set_xticks(np.

arange(most_rated_movies_users_selection.

shape[1]) , minor=False) ax.

invert_yaxis() ax.

xaxis.

tick_top() labels = most_rated_movies_users_selection.

columns.

str[:40] ax.

set_xticklabels(labels, minor=False) ax.

set_yticklabels(most_rated_movies_users_selection.

index, minor=False) plt.

setp(ax.

get_xticklabels(), rotation=90) else: ax.

get_xaxis().

set_visible(False) ax.

get_yaxis().

set_visible(False) ax.

grid(False) ax.

set_ylabel('User id')# Separate heatmap from color bar divider = make_axes_locatable(ax) cax = divider.

append_axes("right", size="5%", pad=0.

05)# Color bar cbar = fig.

colorbar(heatmap, ticks=[5, 4, 3, 2, 1, 0], cax=cax) cbar.

ax.

set_yticklabels(['5 stars', '4 stars','3 stars','2 stars','1 stars','0 stars'])plt.

show()# Print the heatmapdraw_movies_heatmap(most_rated_movies_users_selection)To understand this heatmap:Each column is a different movie.

Each row is a different user.

The cell’s color is the rating that each user has given to each film.

The values for each color can be checked in the scale of the right.

The white values correspond to users that haven’t rated the movie.

In order to improve the performance of the model, we’ll only use ratings for 1000 movies.

# Pivot the dataset and choose the first 1000 moviesuser_movie_ratings = pd.

pivot_table(ratings_title, index='userId', columns= 'title', values='rating')most_rated_movies_1k = get_most_rated_movies(user_movie_ratings, 1000)In addition, as k-means algorithm does not deal well with sparse datasets, we will need to cast it as the sparse csr matrix type defined in the SciPi library.

To do so, we will need first to convert the dataset to a Sparse Dataframe, and the use the to_coo() method in pandas to convert it to sparse matrix.

# Conversion to sparse csr matrixsparse_ratings = csr_matrix(pd.

SparseDataFrame(most_rated_movies_1k).

to_coo())Massive ClusteringWe will take an arbitrary number of clusters in order to make an analysis of the results obtained and spot certain trends and commonalities within each group.

This number will be K = 20.

After that, we will plot each cluster as a heatmap.

# 20 clusterspredictions = KMeans(n_clusters=20, algorithm='full').

fit_predict(sparse_ratings)# Select the mas number of users and movies heatmap clustermax_users = 70max_movies = 50# Cluster and print some of themclustered = pd.

concat([most_rated_movies_1k.

reset_index(), pd.

DataFrame({'group':predictions})], axis=1)draw_movie_clusters(clustered, max_users, max_movies)We can notice some things from these heatmaps:The more vertical lines of the same color in the cluster, the more similar the ratings will be in that cluster.

Some clusters are more sparse than others, that show that the algorithm tends to group also people that watch and rate less movies.

Clusters tend to have a dominant color: Yellowish if they liked their rated movies an blueish if don’t.

Horizontal lines with the same color correspond to users with low variety in their ratings, they tend to like or dislike most of the movies.

PredictionNow we will choose a cluster analyze it and try to make a prediction with it.

# Pick a cluster ID from the clusters abovecluster_number = 11# Let's filter to only see the region of the dataset with the most number of values n_users = 75n_movies = 300cluster = clustered[clustered.

group == cluster_number].

drop(['index', 'group'], axis=1)# Sort and print the clustercluster = sort_by_rating_density(cluster, n_movies, n_users)draw_movies_heatmap(cluster, axis_labels=False)And now we will show the ratings:# Print the ratingscluster.

fillna('').

head()Now we will take one of the blank cells, which are movies that haven’t been rated by the users, and we will try to predict wether if he/she would have liked it or not.

Users aregrouped in a clusters with other users that presumably have similar taste to theirs, so it is reasonable to think that he/she would have rated a blank movie with the average of the rest of the users of its cluster.

And thats how we will proceed.

# Fill in the name of the column/movie.

e.

g.

'Forrest Gump (1994)'movie_name = "Matrix, The (1999)"cluster[movie_name].

mean()RecommendationsUsing the logic of the previous step, if we calculate the average score in the cluster for every movie, we will have an understanding for how the custer feels about each movie in the dataset.

# The average rating of 20 movies as rated by the users in the clustercluster.

mean().

head(20)This is really useful for us because we can use it as a recommendation engine that will recommend users to discover movies they’re likely to enjoy.

When a user logs in to our app, we can now show them recommendations that are appropriate to their taste.

The formula for these recommendations is to select the cluster’s highest-rated movies that the user did not rate yet.

# Pick a user ID from the datasetuser_id = 19# Get all this user's ratingsuser_2_ratings = cluster.

loc[user_id, :]# Which movies did they not rate.user_2_unrated_movies = user_2_ratings[user_2_ratings.

isnull()]# What are the ratings of these movies the user did not rate?avg_ratings = pd.

concat([user_2_unrated_movies, cluster.

mean()], axis=1, join='inner').

loc[:,0]# Let's sort by rating so the highest rated movies are presented firstavg_ratings.

sort_values(ascending=False)[:20]These would be our Top 20 recommendations to that user.

.. More details