Book Recommender EnginesJen HillBlockedUnblockFollowFollowingMay 19I decided to build a couple recommender engines so I could better explore the behind the scenes for how they operate.
I built two engines that each recommend books.
One is collaborator-based and the other is content-based.
PART I: COLLABORATOR ENGINEWith collaborator engines, recommendations are based on how users interact with products.
This includes purchasing or rating an item, as well as watching or listening to media.
I chose Book Crossing’s data set for this engine because in includes user ratings.
EXPLORATORY DATA ANALYSISOnce I cleaned the data, I wanted to take a look at ratings to see what the range of scoring looked like.
The ratings scores run along a scale of 1–10, and of the 123k books that were rated, these users appeared to have loved most of what they read.
Eight is the most popular rating, and very few ratings occur that are below five:Next, I want to take a look at what the most popular books were for these users:The Lovely Bones rated well above all the rest.
This data set was created in 2004 and The Lovely Bones was released in 2002, so the timing does fit with how popular the book was at the time.
RECOMMENDERNext, I pulled a sample of my data based on how many books each user rated and ran it through the recommender.
I started with a sample set of users who had each rated more than 100 books.
Then I tested putting a variety of title names into the system to see what books would be recommended.
Here’s an example:In this case, I asked for books to be recommended based on Interview with the Vampire.
On the right are the recommendations.
I showcase three covers at the top of the list to call out that these three titles are also by Anne Rice, so it’s fitting for them to be recommended here.
Two are actually from the same series as Interview with the Vampire.
There are books in here that are unexpected to me, like the Seinfeld book or Candide.
However, my user set is actually pretty small, so it’s recommending books based on a very select audience.
I tried changing the number of users in my sample set from those who rated more than 100 books to 50 books and my scores got worse.
So I tried increasing the number of books reviewed to more than 200 and my scores got better.
This was was unexpected because we really do need a larger sample of users here to have more balanced recommendations.
It is likely that with this pool of users, going with less of them who rate more books is giving us a more like-minded subset.
PART II: CONTENT ENGINEWith content-based engines, recommendations are made based on the features of the product.
With this in mind, I chose a Goodreads data set from Kaggle because it includes feature information for each title, such as genre and average rating.
EXPLORATORY DATA ANALYSISOnce I cleaned the data, I wanted to take a look at what the genre split was:Romance and fantasy appear to dwarf most of the other genres, and it was something I kept in mind when moving onto the model.
RECOMMENDERI pulled a sample based on year published.
In this case, I pulled in all books that were published from 1900 and onward.
As with the last engine, I tested putting a variety of title names into the system to see what books would be recommended.
However, I also wanted to take a look at some of the same books I tested before, such as Interview with the Vampire:This time, you can see that it didn’t recommend any other Anne Rice books.
In fact, there’s a picture book listed, If You Give a Mouse a Cookie.
And that isn’t exactly age appropriate.
I would not recommend a picture book to someone reading Interview with the Vampire.
Seeing that, I backed up and adjusted what columns of data I was pulling into the model.
I tested a variety of fields from the number of ratings to pages to whether or not I included author or genre.
While I noticed some interesting changes in the results, such as more children’s picture books showing up when I removed pages as a feature, my best results continued to be with including all the book specific data columns.
Still, this direction appears to be flawed based on what titles are being recommended.
The scores look great, but on closer look, the results are not so great.
The issue is that with content-based engines, we need more data than what I have here in this data set.
Having book descriptions would be a good start.
Fuller content paints a better picture of what the book is actually about.
CONCLUSIONOf these two engines, the collaborator engine gave me slightly better recommendations based on book subject matter than the content engine did.
The latter gave better scores, but looking closer I can see the subject matter doesn’t match up as well.
This does not mean I would rule out using a content-based engine though.
For this kind of engine, I would recommend pulling in more descriptive content and using Natural Language Processing to assess the relationship between the words and make recommendations based on that.
NEXT STEPSI do see great potential in these engines.
Had I more time for this project, the next steps I would have taken and recommend considering are the following:Collect a larger master data set to test the models on.
My collaborator engine needed more user ratings to be more balanced, while my content engine needed more descriptive content.
So a bigger more complete data set is need.
One can be collected via web scraping or an API that includes product descriptions and user reviews.
Something I also ran out of time to do was to look into importing foreign language packages.
How foreign titles get addressed in the data will need to be something considered for next steps.
Either import a package that can handle foreign characters, or remove them from the data set.
There was some author related data in the Goodreads set that I didn’t get a chance to test, such as author genre and average rating.
I would like to see if that would help with getting getting better recommendations or not.
Lastly, I would explore options to add on a web-based front-end to the engine with a consumer-friendly interface.
Full code for this project is available on GitHub.
Cheers!.. More details