Introduction to product recommender (with Apple’s Turi Create)

In previous approaches only existing ratings for items are used, but, of course, not everyone voted every movie.

To do so, we use latent features, meaning the model will learn how to predict ratings with the smallest error using a matrix factorisation approach (a lot of matrix calculus involved here.

anyway is math, not magic).

So, now that we have some common ground, let’s see some code!Turi Create documentation is the starting point.

As you can see, there is a lot of stuff to deal with images, text and there are “essential” tools like regression, classification and clustering too.

Of course, now we focus on recommender modelsFirst, the data:import turicreate as tcmovies = tc.

SFrame.

read_csv("ml-latest-small/movies.

csv", header=True, delimiter=',')moviesFinished parsing file /home/antonello/Documents/py-notebooks/movielens_recommender/ml-latest-small/movies.

csvParsing completed.

Parsed 100 lines in 0.

036318 secs.

——————————————————Inferred types from first 100 line(s) of file as column_type_hints=[int,str,str]If parsing fails due to incorrect types, you can correctthe inferred type list above and pass it to read_csv inthe column_type_hints argument——————————————————Finished parsing file /home/antonello/Documents/py-notebooks/movielens_recommender/ml-latest-small/movies.

csvParsing completed.

Parsed 9742 lines in 0.

03129 secsTuri Create use its own data type, called SFrame, similar to Pandas Dataframe, giving some verbose output too.

And it’s possible to do some EDA with a single command!movies.

show()Totally useless for moviesId, a bunch of movies are double (no big deal) and the count for genres that is interesting.

Now the users ratingsratings = tc.

SFrame.

read_csv("ml-latest-small/ratings.

csv", header=True, delimiter=',')ratingsFinished parsing file /home/antonello/Documents/py-notebooks/movielens_recommender/ml-latest-small/ratings.

csvParsing completed.

Parsed 100 lines in 0.

049596 secs.

——————————————————Inferred types from first 100 line(s) of file as column_type_hints=[int,int,int,int]If parsing fails due to incorrect types, you can correctthe inferred type list above and pass it to read_csv inthe column_type_hints argument——————————————————Finished parsing file /home/antonello/Documents/py-notebooks/movielens_recommender/ml-latest-small/ratings.

csvParsing completed.

Parsed 100836 lines in 0.

051093 secs.

ratings['rating'].

show()Ok, time to see recommender in action.

Let’s start with popularity recommender (valid for everyone).

Let’s see the top three movies (k=3) for first five users, joining the title and genre too, for clarity.

model = tc.

recommender.

popularity_recommender.

create(ratings, user_id='userId', item_id='movieId', target='rating')most_popular = model.

recommend(users=[1,2,3,4,5],k=3)most_popular = most_popular.

join(right=movies,on={'movieId':'movieId'},how='inner').

sort(['userId','rank'], ascending=True)most_popular.

print_rows(num_rows=15)Recsys training: model = popularityWarning: Ignoring columns timestamp; To use these columns in scoring predictions, use a model that allows the use of additional features.

Preparing data set.

Data has 100836 observations with 610 users and 9724 items.

Data prepared in: 0.

099973s100836 observations to process; with 9724 unique items.

+——–+———+——-+——+——————————–+| userId | movieId | score | rank | title |+——–+———+——-+——+——————————–+| 1 | 6835 | 5.

0 | 1 | Alien Contamination (1980) || 1 | 5746 | 5.

0 | 2 | Galaxy of Terror (Quest) (.

|| 1 | 131724 | 5.

0 | 3 | The Jinx: The Life and Dea.

|| 2 | 3851 | 5.

0 | 1 | I'm the One That I Want (2000) || 2 | 6835 | 5.

0 | 2 | Alien Contamination (1980) || 2 | 5746 | 5.

0 | 3 | Galaxy of Terror (Quest) (.

|| 3 | 1151 | 5.

0 | 1 | Lesson Faust (1994) || 3 | 3851 | 5.

0 | 2 | I'm the One That I Want (2000) || 3 | 131724 | 5.

0 | 3 | The Jinx: The Life and Dea.

|| 4 | 6835 | 5.

0 | 1 | Alien Contamination (1980) || 4 | 5746 | 5.

0 | 2 | Galaxy of Terror (Quest) (.

|| 4 | 131724 | 5.

0 | 3 | The Jinx: The Life and Dea.

|| 5 | 6835 | 5.

0 | 1 | Alien Contamination (1980) || 5 | 5746 | 5.

0 | 2 | Galaxy of Terror (Quest) (.

|| 5 | 131724 | 5.

0 | 3 | The Jinx: The Life and Dea.

|+——–+———+——-+——+——————————–++——————————–+| genres |+——————————–+| Action|Horror|Sci-Fi || Action|Horror|Mystery|Sci-Fi || Documentary || Comedy || Action|Horror|Sci-Fi || Action|Horror|Mystery|Sci-Fi || Animation|Comedy|Drama|Fantasy || Comedy || Documentary || Action|Horror|Sci-Fi || Action|Horror|Mystery|Sci-Fi || Documentary || Action|Horror|Sci-Fi || Action|Horror|Mystery|Sci-Fi || Documentary |+——————————–+[15 rows x 6 columns]The results are slightly different for some users because, if someone already rated that movie, it’s not proposed again.

Smart!Let’s now try item-item similarity.

Now we split between training and validation data, so we’ll have the possibility to evaluate model performancetraining_data, validation_data = tc.

recommender.

util.

random_split_by_user(ratings, 'userId', 'movieId',item_test_proportion=0.

2)model = tc.

recommender.

item_similarity_recommender.

create(training_data, user_id='userId', item_id='movieId', target='rating')items_similarity = model.

get_similar_items()Recsys training: model = item_similarityWarning: Ignoring columns timestamp; To use these columns in scoring predictions, use a model that allows the use of additional features.

Preparing data set.

Data has 80673 observations with 610 users and 8972 items.

Data prepared in: 0.

105496sTraining model from provided data.

Gathering per-item and per-user statistics.

+——————————–+————+| Elapsed Time (Item Statistics) | % Complete |+——————————–+————+| 2.

441ms | 100 |+——————————–+————+Setting up lookup tables.

Processing data in one pass using dense lookup tables.

+————————————-+——————+—————–+| Elapsed Time (Constructing Lookups) | Total % Complete | Items Processed |+————————————-+——————+—————–+| 311.

493ms | 0 | 3 || 1.

47s | 100 | 8972 |+————————————-+——————+—————–+Finalizing lookup tables.

Generating candidate set for working with new users.

Finished training in 1.

51143sBefore to evaluate the model, using some serious KPI, let’s empirically test with a movie, “Alien” (movieId 1214)(items_similarity[(items_similarity['movieId'] == 1214)]).

join(right=movies,on={'similar':'movieId'},how='inner').

sort('rank', ascending=True).

print_rows()+———+———+———————+——+| movieId | similar | score | rank |+———+———+———————+——+| 1214 | 1200 | 0.

517241358757019 | 1 || 1214 | 1097 | 0.

3395061492919922 | 2 || 1214 | 1089 | 0.

33529412746429443 | 3 || 1214 | 1210 | 0.

32692307233810425 | 4 || 1214 | 1198 | 0.

3051643371582031 | 5 || 1214 | 1136 | 0.

2971428632736206 | 6 || 1214 | 1387 | 0.

28767120838165283 | 7 || 1214 | 1653 | 0.

2789115905761719 | 8 || 1214 | 260 | 0.

273809552192688 | 9 || 1214 | 1213 | 0.

273809552192688 | 10 |+———+———+———————+——++——————————-+——————————–+| title | genres |+——————————-+——————————–+| Aliens (1986) | Action|Adventure|Horror|Sci-Fi || E.

T.

the Extra-Terrestrial.

| Children|Drama|Sci-Fi || Reservoir Dogs (1992) | Crime|Mystery|Thriller || Star Wars: Episode VI – Re.

| Action|Adventure|Sci-Fi || Raiders of the Lost Ark (I.

| Action|Adventure || Monty Python and the Holy .

| Adventure|Comedy|Fantasy || Jaws (1975) | Action|Horror || Gattaca (1997) | Drama|Sci-Fi|Thriller || Star Wars: Episode IV – A .

| Action|Adventure|Sci-Fi || Goodfellas (1990) | Crime|Drama |+——————————-+——————————–+[10 rows x 6 columns]Awesome!.Aliens is first rank and other really good scifi and thriller movies are proposed (except Monty Python, that is a bit odd).

To evaluate the model, using RMSE (Root Mean Squared Error), just a single command to launchmodel.

evaluate(validation_data)Overall RMSE: 3.

5057222562438963Let’s try now the factorization approach and let’s see how it performsmodel = tc.

recommender.

ranking_factorization_recommender.

create(training_data, user_id='userId', item_id='movieId', target='rating')results = model.

recommend(k=3)Recsys training: model = ranking_factorization_recommenderPreparing data set.

Data has 80673 observations with 610 users and 8972 items.

Data prepared in: 0.

133147sTraining ranking_factorization_recommender for recommendations.

+——————————–+————————————————–+———-+| Parameter | Description | Value |+——————————–+————————————————–+———-+| num_factors | Factor Dimension | 32 || regularization | L2 Regularization on Factors | 1e-09 || solver | Solver used for training | adagrad || linear_regularization | L2 Regularization on Linear Coefficients | 1e-09 || ranking_regularization | Rank-based Regularization Weight | 0.

25 || max_iterations | Maximum Number of Iterations | 25 |+——————————–+————————————————–+———-+ Optimizing model using SGD; tuning step size.

Using 10084 / 80673 points for tuning the step size.

+———+——————-+——————————————+| Attempt | Initial Step Size | Estimated Objective Value |+———+——————-+——————————————+| 0 | 16.

6667 | Not Viable || 1 | 4.

16667 | Not Viable || 2 | 1.

04167 | Not Viable || 3 | 0.

260417 | Not Viable || 4 | 0.

0651042 | 1.

10129 || 5 | 0.

0325521 | 1.

53943 || 6 | 0.

016276 | 1.

89349 || 7 | 0.

00813802 | 1.

97929 |+———+——————-+——————————————+| Final | 0.

0651042 | 1.

10129 |+———+——————-+——————————————+Starting Optimization.

+———+————–+——————-+———————–+————-+| Iter.

| Elapsed Time | Approx.

Objective | Approx.

Training RMSE | Step Size |+———+————–+——————-+———————–+————-+| Initial | 226us | 2.

32822 | 1.

08971 | |+———+————–+——————-+———————–+————-+| 1 | 451.

511ms | 2.

1881 | 1.

16773 | 0.

0651042 || 2 | 939.

443ms | 1.

90281 | 1.

08491 | 0.

0651042 || 3 | 1.

36s | 1.

76298 | 1.

02327 | 0.

0651042 || 4 | 1.

84s | 1.

65405 | 0.

987049 | 0.

0651042 || 5 | 2.

26s | 1.

5937 | 0.

965333 | 0.

0651042 || 10 | 4.

35s | 1.

43729 | 0.

899279 | 0.

0651042 || 20 | 8.

51s | 1.

1761 | 0.

8083 | 0.

0651042 || 25 | 10.

58s | 1.

07333 | 0.

766321 | 0.

0651042 |+———+————–+——————-+———————–+————-+Optimization Complete: Maximum number of passes through the data reached.

Computing final objective value and training RMSE.

Final objective value: 1.

04304 Final training RMSE: 0.

734723A lot of things happening behind the curtains…The algorithm is trying to learn from latent features and minimize the error (RMSE), using stochastic gradient descent while optimizing the learning rate.

We can see the final RMSE is 0.

73Let’s see some data…def join_titles(sframe,on): return sframe.

join(right=movies, on=on, how='inner')results = join_titles(results,'movieId')results.

sort(['userId','rank'], ascending=True).

print_rows(20)+——–+———+——————–+——+——————————-+| userId | movieId | score | rank | title |+——–+———+——————–+——+——————————-+| 1 | 296 | 5.

4481021343133005 | 1 | Pulp Fiction (1994) || 1 | 318 | 5.

38932802065821 | 2 | Shawshank Redemption, The .

|| 1 | 858 | 5.

369748759116844 | 3 | Godfather, The (1972) || 2 | 1198 | 4.

917704129159317 | 1 | Raiders of the Lost Ark (I.

|| 2 | 356 | 4.

897511178195343 | 2 | Forrest Gump (1994) || 2 | 260 | 4.

891662919461593 | 3 | Star Wars: Episode IV – A .

|| 3 | 541 | 5.

270915883626655 | 1 | Blade Runner (1982) || 3 | 1394 | 5.

122784393872932 | 2 | Raising Arizona (1987) || 3 | 50 | 4.

789787578429893 | 3 | Usual Suspects, The (1995) || 4 | 1193 | 5.

191378074731544 | 1 | One Flew Over the Cuckoo's.

|| 4 | 318 | 5.

121037292327598 | 2 | Shawshank Redemption, The .

|| 4 | 1247 | 5.

0619253991505655 | 3 | Graduate, The (1967) || 5 | 2959 | 4.

786239120211318 | 1 | Fight Club (1999) || 5 | 7361 | 4.

614468338932708 | 2 | Eternal Sunshine of the Sp.

|| 5 | 1193 | 4.

590772422995284 | 3 | One Flew Over the Cuckoo's.

|| 6 | 1197 | 4.

813408214050211 | 1 | Princess Bride, The (1987) || 6 | 50 | 4.

795824933248438 | 2 | Usual Suspects, The (1995) || 6 | 2858 | 4.

793457645374216 | 3 | American Beauty (1999) || 7 | 1198 | 5.

20988603219004 | 1 | Raiders of the Lost Ark (I.

|| 7 | 2571 | 5.

106401997651771 | 2 | Matrix, The (1999) |+——–+———+——————–+——+——————————-++——————————-+| genres |+——————————-+| Comedy|Crime|Drama|Thriller || Crime|Drama || Crime|Drama || Action|Adventure || Comedy|Drama|Romance|War || Action|Adventure|Sci-Fi || Action|Sci-Fi|Thriller || Comedy || Crime|Mystery|Thriller || Drama || Crime|Drama || Comedy|Drama|Romance || Action|Crime|Drama|Thriller || Drama|Romance|Sci-Fi || Drama || Action|Adventure|Comedy|Fa.

|| Crime|Mystery|Thriller || Drama|Romance || Action|Adventure || Action|Sci-Fi|Thriller |+——————————-+[1830 rows x 6 columns]Now there is a specific recommendation for every user.

Let’s evaluate the model on validation datamodel.

evaluate(validation_data)'rmse_overall': 1.

0967441224583008}The value is much lower (so it’s better) than the 3.

5 given by the item-item model.

Of course this is just a baseline, there are more possibilities to improve, for example combining content and collaborative filtering models.

Turi Create is not the only library to try: there are Surprise (Simple Python RecommendatIon System Engine) and Apache PredictionIO (and probably many more).

Knowing the basics, will be easier to test and compare different solutions.

And that’s all, for this basic intro to PR…see you next time!.

. More details

Leave a Reply