A “full-stack” data science project

As I mulled over this idea, I realized that other information about a book such as ratings, reviews, description, etc.

may also be valuable to run a few experiments.

The objective was to gain experience building a model from scratch.

I will try to follow the standard workflow for a data science project, as illustrated below.

After a bit of research, I decided to go with Goodreads as the source of my dataset, as the amount of information they had on their book pages was very comprehensive and in a mostly standard format.

I decided to ‘scrape’ the following pieces of information:TitleDescriptionAuthorsEditionFormatISBNNo.

of pagesRatingNo.

of ratingsNo.

of reviewsGenresBook cover imageI have given an example of a book webpage from Goodreads, with the various pieces of information annotated below.

An annotated image of a Goodreads page, highlighting the fields that will be extractedIn the next section, we will take a detailed look at the steps involved in data collection.


Data collectionSince the objective is to build a collection of data for a number of books, I started by looking at the lists of books available here.

The “best books ever” list had more than 50000 books and I decided to use this as my source for training data.

In order to make the data collection easier, I broke up the process into three parts:1.

1 URL collectionThe first step is to collect the URLs of the individual book pages by scraping the list of books:An example of a book “list” to be scraped.

One of the most popular and easy-to-use packages in Python to collect static data from web pages is BeautifulSoup.

It provides access to the various HTML elements of a page in a Pythonic way.

If the page content is dynamically generated using JavaScript though, Selenium will be a better, albeit slower, choice.

Fortunately for us, all of the content we need for our analysis is static on the Goodreads web pages.

The two most popular and powerful methods in the BeautifulSoup package are find and find_all.

As long as we can identify an element using the tag and the class, we can extract almost any information from a web page using these two methods.

The approach I follow to collect data from web pages is to start “inspecting” the element that has the data to find out how it is structured in the HTML format.

For example, right clicking the “Animal Farm” title in the list shown above and clicking Inspect shows that the book data on the page is arranged in a table format, and it is available in a table data tag, <td>:<td width=”100%” valign=”top”> <a class=”bookTitle” itemprop=”url” href=”/book/show/7613.

Animal_Farm”> <span itemprop=”name”>Animal Farm</span></a>Thus, by accessing the <td> tag in each <tr> row of the table, we can extract the URLs using just a few lines of code:1.

2 Data collectionNow that we have the URLs collected in the books.

csv file, the next step is to collect data on each book in the file.

This will be a long process, as scraping each book web page for data can consume a lot of time.

We should also keep in mind that the server should not be overloaded with our requests, so we should give enough time between requests to the web server.

Unlike URL collection, each and every data element in the data collection process has different ways of declaration in the HTML web page.

Thus, for each element, we need to figure out how it is structured, which tag and class to access and how to extract the data for our analysis.

Instead of walking through these elements one by one, I’ve given the HTML elements that were used to access them in the table below with an example.

Data attributes from the book web page and their corresponding HTML tagsThe code used to extract these data is available here.


3 Image collectionThe next step is to collect the book covers.

This is relatively easy since we have already collected the image URLs of the book covers — all that is required is to simply download them.

Given below is the code I used to do the same, using the wget package:I followed these steps to also collect data on the best books of 2018, which are found in the list here.

I used this as my test dataset, to evaluate any models I train on the data I collected on the best books ever.

A few best practices to note:Always make sure you leave enough time between your requests to the server in order to avoid overloading and getting kicked outSave your work every few iterationsNot all data elements may be available on all web pages, so include checks in your code to account for the sameIn order to make it easy to load and work with these two datasets, I have uploaded them to Kaggle.

You can find them here and here.

Also, the code used to for data collection is on GitHub here.

In the next section we can explore the data to see the different characteristics, before starting to work on a specific classification or prediction task.


Data explorationThe notebook exploring the data is available on GitHub here.

Regardless of the data analysis you’re performing, or how well you think you know your data, it is always a good idea to take a look at it and be aware of the various characteristics before starting to work on a specific prediction or classification task.

Let us start by taking a look at the data that we have collected:The first 10 rows of the training dataset (Best books ever, from Goodreads)There are a few things that stand out from a initial view of the dataset:English is not the only language — there are other languages in the dataset, this needs to be exploredA book can belong to multiple genresLet us start by looking at the genres first.


1 GenresWe should keep in mind that the genres on the Goodreads website are supplied by the users, and are not the “official” genres of the book given by the author or publisher.

This can lead to multiple genres which could be duplicates or trivial.

Let us first see the distribution of the number of genres tagged to each book.

We can see that the average number of books is around 5 to 6, and the distribution is right-skewed a little bit.

The next question: how many unique genres are there, and which ones are the most frequently occurring?There are a whopping 866 unique genres in the dataset!.I think this is due to the fact that they are supplied by the users and a lot of genres may be misspelled repetitions.

Regardless, let us look at the top genres in the dataset:There seems to be quite a long tail of genres even in the top 50 list.

Interestingly, the number of nonfiction books is quite low (~7.

5K) compared to fiction (~26K).

A majority of these genres fall under the fiction category (Fantasy, Romance, Young Adult, and different types of fiction, such as Science, Historical, Womens, Realistic, etc).

Do the book covers reflect the genres that a book is tagged to?.Let us examine.


2 Book coversI took a random sample of book covers and examined the genres tagged to them:As far as I can tell, it is quite tough to categorize the genre(s) of a book just from the book covers.

Of course, in some cases, we may know the author’s prior works (e.


Stephen King — horror) and could guess the genre, but the book cover in isolation does not seem to give enough information to determine the genres.

Upon further research, I found that this has been the subject of a research paper, titled “Judging a Book By its Cover”, available here.

The authors of the paper did have some success in their task, but noted that:Many books have cover images with few visual features or ambiguous features causing for many incorrect predictions.

While uncovering some of the design rules found by the CNN, we found that books can have also misleading covers.

In addition, because books can be part of multiple genres, the CNN had a poor Top 1 performance.

Knowing what we have gathered about the book covers so far, they probably won’t be part of our analysis.

Let us move on to the next parameter: language.


3 LanguageThe initial glimpse we had into the dataset told us that there were some non-English books included in our list.

Let us take a look at all the different languages available in our dataset.

The langdetect package helps figure out the language of a given piece of text.

This was highly useful in determining the different languages of books available in our dataset by using the description of the book (code below):Plotting the no.

of books for each language, we can see that almost 90% of the books are in English:For our analysis purposes, we can probably remove the records that correspond to non-English books.

Yet, for the sake of curiosity, let us examine the no.

of non-English books in the dataset:Quite a few European languages feature in this list.

I am also intrigued by the fact that there are quite a few Indian languages in here too.

Being from Chennai, India, I want to know which Tamil books are part of the best books ever:Ponniyin Selvan, regarded the greatest Tamil novel ever written, features on this list! Good to know :)Next stepsExploring the data helped us figure out that trying to predict the genres from the book covers may not be a very fruitful exercise (I may be wrong too!).

I will try to predict the genres using the description of the book.

This is probably quite a trivial task, but it would certainly help in understanding how to build a recurrent neural network from scratch and preparing data for the same.

The other point to note about the genres is that the high number of unique genres makes it a very unwieldy target parameter to predict.

It would make sense to classify a book as either fiction or nonfiction by preprocessing the data accordingly.

I will walk through these steps in the next sections.


Data cleaningAlthough the data available on Goodreads is well structured and formatted for the most part, there were some issues with the dataset that needed to be fixed.

Since the data of interest for our analysis is the book’s description and the associated genres, these were the issues that I encountered:Some of the records did not have any genresSome of the records did not have valid descriptionsSome records did not have either ‘fiction’ or ‘nonfiction’ in the list of tagged genresAs we saw earlier, some records have a description in a language other than EnglishSome book descriptions had non-printable characters and formatting issuesUncovering these issues took a lot of trial and error, especially the formatting issues which needed a close examination of the descriptions themselves.

Nonetheless, I wrote helper functions for each of the steps mentioned above and ran them through the train and test datasets:It seems like a significant number of records were dropped because they were tagged as neither fiction nor nonfiction.

When I examined such records I found that it was either because the tag was just missing because users did not tag them, or because they were of a neutral category, like poetry, among others.


Data preprocessingNow that we have taken care of as many issues as we can in our data, we should get the data ready to be fed into a model.

The first step in accomplishing that is to make sure that all the data to be fed to the model is of the same format, and of the same length.

Naturally, all the descriptions of the books have varying lengths.

How do we determine the optimal length of a description to be fed to the neural network model?4.

1 Clipping and PaddingI am going to assume that if we are able to accommodate at least 80% of the book descriptions within a certain length, i.


, no.

of words, then our model should perform reasonably well.

In order to determine this length, let us plot a cumulative histogram of the description lengths in our training dataset:If you hover over the bars in the chart above, you can observe that about 80% of the records fall below a word count of around 207.

Let us take the maximum threshold word count as 200.

What does this mean?.This means that for records where the description is less than 200 words, we will pad them with empty values, whereas for records where the description is more than 200 words, we will clip them to include just the first 200.

In addition to the maximum threshold, we also need a minimum threshold to make sure that the descriptions have at least a few words to actually predict the genre.


2 TokenizationTokenization, as it pertains to recurrent neural networks, refers to the process of converting sequences of things — be it characters or words — into sequences of integers.

This simply means that for each word in our corpus of descriptions across the training and test datasets, we assign an integer that we will refer to as ‘token’.

Thus, the input to the neural network will be a sequence of tokens representing the words which form the description.

Keras, the awesome deep learning library by François Chollet, has a predefined method for tokenizing sequences.

However, I wanted to build the tokenizer myself just for fun.

The first step in tokenization is to build a vocabulary of all words available in the descriptions.

Once this vocabulary is available, assigning tokens is a simple task of referring to the indices of the words in the vocabulary.

During the process of tokenizing we can also do the clipping and padding that we saw earlier.

A couple of things to note about padding:Since we will use the integer zero for padding, it is important to not assign any word to this token in the vocabulary.

The recurrent neural network will read the token sequence left-to-right and will output a single prediction for whether the book is fiction or nonfiction.

The memory of these tokens are passed on one by one to the final token, and thus, it is important to pre-pad the sequence instead of post-padding it.

This means that the zeros are added BEFORE the token sequence and not after.

There are situations where post-padding may be more effective, for example, in bi-directional networks.

Given below is the code I used to tokenize the descriptions into integer sequences of a fixed length.


3 Training and validation data setsAlthough creating training and validation sets is a fairly standard process, when the dataset is imbalanced, i.


, the distribution of target variable is not uniform, we should make sure that the training-validation split is stratified.

This ensures that the distribution of the target variable is preserved in both the training and validation datasets.


Model developmentThe next step is to build a recurrent neural network to process the tokenized descriptions and classify them as fiction or nonfiction.

I used the Keras library to build a sequential model for this purpose.

Keras is one of the best APIs available today for neural network development, noted especially for its user-friendliness and easy-to-understand structure.

The layers that I used in the sequential model as are as follows (in that order):5.

1 EmbeddingBefore we go into details of an embedding layer, let us see why it is useful in this situation.

We saw earlier that we built a vocabulary of all available words in the descriptions and we converted the descriptions into sequences of tokens (integers).

Now, these integers cannot be passed to the neural network in their raw form, because the magnitude of these integers don’t really have any meaning.

For example, the word corresponding to token 1 is ‘gremlin’ and the word for token 2 is ‘collaborated’.

This does not mean that ‘collaborated’ is twice the word that ‘gremlin’ is.

So, how do we handle this?Such variables, called categorical variables, are usually passed into a machine learning model through one-hot encoding.

For example, if we have 5 words in our vocabulary and we have tokenized them like so:[outbound, hearse, select, dogged, rowboats][1, 2, 3, 4, 5]where outbound corresponds to 1, hearse to 2 and so on.

The one-hot encoded version of this tokenized representation would be:outbound: [1, 0, 0, 0, 0]hearse: [0, 1, 0, 0, 0]select: [0, 0, 1, 0, 0]dogged: [0, 0, 0, 1, 0]rowboats: [0, 0, 0, 0, 1]Thus, each word gets represented by a vector whose length is the total no.

of words in the vocabulary.

This representation of a categorical variable works well if the no.

of categories is low.

In our case, the no.

of categories is actually the no.

of unique words in our vocabulary, which is more than 90000.

And if I have 200 words in each record in the dataset, this means that my dataset becomes a n*90000*200 tensor, where n is the total no.

of records.

The other problem with the one-hot encoding when there are too many categories, is that the representation becomes too sparse, i.


, there are too many zeroes compared to ones in the matrix.

Some algorithms may not work very well with sparse representations.

So, we need an alternate way of representing our input.

This is where embeddings come in.

Simply put, an embedding layer ‘learns’ a fixed-length numerical representation of each input category.

The length of of the embedding vector can be decided by the user.

In general, embeddings of higher length are able to learn more complex representations.

In the example we saw above, say we train an embedding layer of length 10, the resulting embedding vectors after training may look like this:Thus, if we train an embedding layer of length 200 to represent our vocabulary, the layer effectively reduces the dimension of our input from n*90000*200 to n*200*200.

This makes computation far easier, and our representations denser.


2 Stacked LSTMThe next layer of our model is the meat of the recurrent neural network — the LSTM (long-short term memory) layer.

Simply put, an LSTM layer typically retains memory, regardless of how long the sequence may be.

How much it remembers is something it learns based on the relationship between the input sequences and the target variable.

In our case, we pass sequences of length 200 to the LSTM layer.

At each word in the sequence, the LSTM layer produces an output state, which is passed on to the next word, and optionally a hidden state that is passed on to another LSTM layer, if need be.

I found the concept of LSTM very complex initially, but this post by Christopher Olah explains it really well:Understanding LSTM Networks — colah's blogThese loops make recurrent neural networks seem kind of mysterious.

However, if you think a bit more, it turns out that…colah.


ioIn our example, I am using a 2-layer LSTM, where the first layer produces a hidden state sequence that is passed on to the second layer.


3 Fully connectedThe fully connected (aka ‘Dense’) layer takes the output of the LSTM layer and maps it to a single target variable, through a sigmoid activation that compresses the input to a number between 0 and 1.

A visual representation of the network architecture is given below for a better understanding:The code to create the sequential model is given below.

The summary of the model, as given by Keras:6.

Model evaluationThe final step is training our model, evaluating against the validation set, and testing it with our test dataset.

The mode was able to achieve more than 90% accuracy on the validation set with 2 epochs of training, and around 95% on the test dataset.

Train on 23393 samples, validate on 5848 samples Epoch 1/2 23393/23393 [==============================] — 162s 7ms/step — loss: 0.

3532 — acc: 0.

8542 — val_loss: 0.

3033 — val_acc: 0.

9020 Epoch 2/2 23393/23393 [==============================] — 160s 7ms/step — loss: 0.

1660 — acc: 0.

9393 — val_loss: 0.

2347 — val_acc: 0.

9152 657/657 [==============================] — 13s 19ms/step[0.

14756220990516913, 0.

9482496194824962]ConclusionsAll the code used for data collection, exploration and model development are available on GitHub here:meetnaren/Goodreads-book-analysisA data science project to classify a book as fiction or nonfiction based on the blurb or description …github.

comI have also tried replicating the model development in PyTorch here.

PyTorch gives greater control over how the data is fed into the network and lets us define the operations that constitute a forward pass through the network.

For simple sequential models like the one described in this post though, Keras is an excellent option.

.. More details

Leave a Reply