What Book should I Read Next?

The platforms I use should have plenty of data to help me solve this problem.

However, Libby — through the Boston Public Library — doesn’t provide any recommendations and the ones from Goodreads, an Amazon company where I store my reading history, aren’t very good.

I want an algorithm that presents me with targeted and relevant options and, as I’m not getting it through my current book providers, I figured I’d try to build it myself.

For the full project code in Python, please click here.

DataThe best data set that I was able to find contains the top 10,000 books from Goodreads along with ratings by user, the books that a user wants to read, and any tags associated with the book by the reader.

The data has its limitations.

We only have access to the 10,000 most popular books and can’t layer in additional information like genre or book description.

The data is as of 2017, so we miss out on new published books.

That said, there’s certainly enough to build the framework for the algorithm.

I was also able to export my own Goodreads data and while not every book that I read was in the top 10,000, ~60 matched after accounting for only the books that I wanted to be considered in the algorithm.

Approach and Critical ThinkingBased on the information available to us, I felt the best approach was to find the “most similar” users to me, then look for the most popular books (by average rating or number of times read) that they’ve read and I haven’t.

This method — looking for patterns among users and applying those patterns to make recommendations — is called collaborative filtering.

It’s essentially a simplified version of what Amazon would use when you see the “Customers who bought this item also bought” set of recommendations under a product.

Example product recommendations when searching for a cold brew pitcher.

I initially framed out the article to focus on the steps of collaborative filtering, but after I finished the first set of code and looked at the results, I realized there was so much beyond the simple filtering and coding to take into consideration.

Instead, the below will focus on how to use intuition and critical thinking to improve on initial results.

Weighting and Standardization of the InputsI realized that there needs to be a way to standardize the popularity of a book overall and within the relevant sample of similar readers.

Otherwise, the most read book by similar users could be unduly influenced by the popularity of the book overall.

This isn’t necessarily bad, but I wanted a way to adjust for sample vs.


To overcome this, I chose to use a method of standardizing the results by dividing the number of times a book was rated (read) and marked as “to read” by the total number of reviews for the book.

This ratio isn’t necessarily significant in any capacity other than helping us order the results and offset total popularity of a book.

Then, when choosing which book to read, do I want the most popular book (by this ratio) or the one with the highest rating?.Should I weight higher books from authors that I’ve already read and enjoyed?.Or, should the one with the most 5-star ratings be preferred?.The short answer is I chose to create a weighting of variables that matched my preferences.

These could be changed in the call of the function if my preferences change:rec_weight_avg_rating = 0.

5, rec_weight_read_ratio = 0.

4, rec_weight_perc_4_5 = 0.

1, rec_weight_perc_1_2 = -0.

1, rec_weight_author_previous = 0.

1Series, New Authors, and My RatingsI noticed as I ran the algorithm that books that were part of a Series dominated the results even as I played around with the inputs.

It’s logical — when someone reads and enjoys book 1 in a series, they’re likely to read book 2.

If I’ve read book 1 too, those users are more likely to match with me.

I noticed a pattern for books in a series where the format is typically: Title, (Series Name, #[1, 2, 3, etc.


Therefore, a book in a series is easily identifiable by searching within the title for the “#” signal.

It’s not perfect, but it captures most.

Therefore, in the call to the function, I added a “Series” option that filters out all series if toggle is set to “No”.

Likewise, the early recommendations were largely from the same authors as I’ve already read — again, understandable given the weighting.

But, what if I want to read a new author?.This is an easy fix — a “new_author_only” toggle was put into the function call so that I can control whether I only see new authors.

Finally, after applying some of these factors, I noticed that my top recommendation was a series that I already read but didn’t necessarily enjoy — I had rated most books a 3/5.

So, to offset this, I took the data that was used for the recommendation, searched each series within my own books, and if the average rating was less than a 4, I removed all instances from the data set.

Books that I Want to ReadI read a variety of types of books — fiction, non-fiction, business, etc.

 — but it doesn’t mean that everything I rated highly I want to be recommended to me.

If we had genre information available, I could specify the genre type of the recommendation that I want, but we don’t have that, so I had to make a workaround.

For the purposes of this exercise, I decided to create a new column and tag the books that I wanted to be considered within the scope of recommendation.

Only these books are used for matching and then the results are adjusted for the other books that I’ve read but don’t want to recommend.

Defining “Similar”How many books do I have to have in common with a user to be considered similar?.Five?.Ten?.20?.How many readers should be included?.The results of the algorithm change depending on how we define similar.

I chose to use the 99th percentile of similar readers, which given the data set was 600+ people that collectively rated nearly 75,000 books.

Again, this percentile can be changed in the function call.

Recommendation OutputsHaving layered in the above and more, what were the results?.Let’s try a couple calls to the function:# Default call to the functionbook_recommendation_system(my_books = my_books, all_books = all_books, ratings = ratings, to_read = to_read, series = 'Yes', new_author_only = 'No', number_of_similar_quantile = 0.

99, english_only = 'Yes', rec_weight_avg_rating = 0.

5, rec_weight_read_ratio = 0.

4, rec_weight_perc_4_5 = 0.

1, rec_weight_perc_1_2 = -0.

1, rec_weight_author_previous = 0.

1, return_dataset = 'No', num_similar_ratings = 50)Basic call to the functionTest the algorithm for new authors onlyWiden the bucket of similar usersChange the weights and widen the audience plus new authors onlyConclusionMy biggest takeaway from the project was:Collaborative filtering — or any recommendation system — can’t be viewed in a box.

Layered on top must be the thinking and intuition that a computer can’t understand.

Flexibility of the system and algorithm are far more valuable than the ability to simply execute on the code to make a program function.

I thought that the project would start and end with the collaborative filtering code.

Instead, I spent far more time on the follow up to refine the system rather than create it in the first place.

Recommendations can be a powerful way to drive sales and satisfaction in a business.

That said, the topic has to be approached carefully and thoughtfully as bad recommendations can be frustrating and deter future sales.

Ultimately, helping someone discover a product or service that they were looking for — or didn’t even know they might want — can be a win for everyone involved and a fun challenge to solve as a data enthusiast.


. More details

Leave a Reply