The chart on the left shows us the vast majority (>98%) of reviews left in our dataset (after earlier removing entries with blank text) have > 10 words.
We’ll use that as our lower cutoff.
Sorting what’s left by length in characters, we see the shortest reviews with more than 10 ‘words’.
Shortest reviews with >10 ‘words’.
Note that ‘words’ are everything between spaces.
Our next task is to construct a vocabulary from all of our reviews.
We’ll use a counter dictionary to count the frequency of each word in our dataset, and then we’ll take the most frequent 10,000 words to build our vocabulary, which we’ll use to ‘unk’ our dataset — i.
replace every word not in our vocabulary with the word ‘unk’.
We also want to convert our dataset to an array of numbers — i.
‘tokens’ — and we can do both of these tasks in the same function.
Testing our function on some sample text:sample = ‘i really loved dfalkjf especially the introduction’.
print(tokenize_text(sample))Unk ID: 24[4, 56, 79, 24, 301, 1, 1190]Notice that the unrecognized ‘dfalkjf’ was given the ‘unk’ token of 24.
The rest of the tokens correspond to the rank of the word in our vocabulary.
After padding, truncating and tokenizing, here’s what our data look like:array([[ 24, 0, 0, .
, 24, 24, 24], [ 24, 9, 11, .
, 24, 24, 24], [ 149, 149, 149, .
, 24, 24, 24], .
, [ 131, 32, 873, .
, 24, 24, 24], [ 5, 3312, 368, .
, 24, 24, 24], [ 172, 195, 1, .
, 24, 24, 24]])Notice the columns of 24 at the end of each row, this is padding with the ‘unk’ token.
Finally we create our binary labels from the ‘overall’ rating, and find that ~79% of reviews receive either a 4 or 5 star rating.
This number is important to note because it means that even the simplest model (one that always predicts 1) will earn a 79% accuracy.
That’s the number to beat.
Training the Model(s)Now that the preparation step is out of the way, we can train our model.
Using TensorFlow and Keras layers we can try a number of different architectures with different numbers of parameters.
All of our models will have an embedding as the first layer, which turns each word into a vector of some length, a hyperparameter.
All our models will also have at least one RNN layer (specifically a long short-term memory or LSTM layer).
This layer will be used in both a forward and backward pass.
In each case the LSTM layer will feed into a dense layer with a relu activation function and an output layer with a sigmoid activation function, which will produce a value between 0 and 1 that will be thresholded to deliver a class prediction.
Other layers we’ll add in will be dropout layers to reduce overfitting, including a special type of dropout layer after our embedding layer that drops entire 1-D feature maps rather than individual words, and a 1-D convolutional layer that will learn a set of filters that will pull out features from relationships between neighboring words.
We’ll also try stacking two layers of LSTMs.
We’ll make use of the binary crossentropy loss, the Adam optimizer, and we’ll employ the early stopping callback, which will stop training when the validation loss starts to increase.
The training set and validation set accuracy for one of our best models.
The validation loss bottomed out on the 5th epoch and training was stopped early.
ResultsThe best-performing model was indeed the most complex of the bunch — with a 93.
8% accuracy on an unseen test set of ~155,000 reviews.
It’s worth noting, however, that the least complex model achieved a 93.
Our least complex model consisted of only three hidden layers: an embedding layer with embedding length of just 8, an LSTM with only 8 units, and a fully connected layer with just 16 units.
It had a total of 81k parameters, and took 53 minutes to train.
model = tf.
Embedding(vocab_size, 8), tf.
Dense(16, activation='relu'), tf.
Dense(1, activation='sigmoid')])Our most complex model, and the winner overall, had 7 hidden layers including dropout, convolutional and pooling layers in addition to the LSTM layer:model = tf.
Embedding(vocab_size, 128), tf.
Conv1D(filters=32, kernel_size=3, padding='same', activation='relu'), tf.
Dense(128, activation='relu'), tf.
Dense(1, activation='sigmoid')])For binary classification, the receiver operating characteristic (ROC) curve gives a good idea of the discriminatory power of a model.
It reflects the fact that as you decrease the threshold on your final probability outputs, you capture more true positives, but also more false positives.
A perfect model will assign higher probabilities to positive than to negative examples, and so decreasing the threshold will capture more positives without capturing more negatives.
Thus the curve will hew closely to the upper left corner.
The AUC equivalently measures the area under the ROC curve (the closer to 1 the better).
Here our model has an AUC on test data of 0.
Let’s test our model by example and feed it a few pieces of text.
We’ll give it four sequences.
Less than 0.
5, the model predicts negative; greater than 0.
“This dreadful book was terrible.
Really, it was just awful.
”Our model correctly scores this a 0.
01 — i.
about as close to 0 as possible.
“I loved this book.
It was outstanding, hilarious and amazing.
”This gets a 0.
How about something ambiguous:“This book was okay, but it made me feel sad.
Our model is torn.
How about combining both positive and negative terms:“The story is really great, but the narrator is horrible.
The model weighs the ‘really great’ section more heavily than the ‘narrator is horrible’ section.
Perhaps if we retrained our model on the ‘performance’ rating, we’d get a different result here.
Finally, let’s look at the word embeddings our model has learned.
The learned vector for each word in our vocabulary should reflect some useful information about that word for predicting positive or negative reviews.
In order to visualize the spatial relationships between these representations, we’ll need to reduce the word vectors to a more human-digestible number of dimensions.
Principal component analysis (PCA) is a method of transforming the data such that its most information-rich dimension (i.
the dimension that contains the greatest variance) becomes axis-aligned (i.
the first dimension of your result).
We’ll use PCA to reduce our 128-dimensional embedding vectors to 2 dimensions so that we may visualize the relationships between words:A 2D representation (the first two principal components) of our learned word embeddings for 61 key words.
The first two principal components of 61 common words in our vocabulary generate the striking chart above.
Here generally unfavorable words, e.
‘fascinating’ and ‘excellent’ , are rendered in blue; favorable words, e.
‘terrible’ and ‘boring’, are rendered in red, and neutral words, e.
‘acting’ and ‘book’ are rendered in black.
Our embeddings clearly reflect the polarity of these terms.
Even finer relationships seem to be represented.
For example, we intuit the relationships between close pairs ‘monotonous’ and ‘monotone’ , ‘worse’ and ‘worst’, ‘tedious’ and ‘tiresome’, ‘sound’ and ‘narrator’, ‘audio’ and ‘quality’, and so on.
By repeating the process over the whole vocabulary and looking at just the first principle component, we can identify the words with the most positive and negative salience.
The word with the highest value is ‘wellspent’ [sic].
The word with the lowest value?‘Refund’.
Finally let’s generate predictions on the entire training set to find the ‘worst’ and ‘best’ reviews.
The ‘worst’ review:WASTE, WASTE, WASTE.
SAVE YOUR CREDIT.
What would have made <title> better?There is nothing that could make this book better.
This book is not even as funny as some say.
Why buy a zombie horror book that is funny?What do you think your next listen will be?Not sure, other than to say something with zombies and or an apocalypse, but definitely nothing by the author or narrator of this book.
What didn’t you like about <narrator>’s performance?The sound of her voice.
That way she made all the characters sound alike.
You didn’t love this book… but did it have any redeeming qualities?Zero… Even worse, since it’s digital media and not a physical book I can’t even burn it to warm up.
I have read terms of agreement that would have been a much better listenread [sic] than this book.
Any additional comments?I’m a true fan of zombie and apocalypse stories in any format.
This book though should not have been written let aloneThe ‘best’ review:AWESOMEAwesomeness awesomeness awesomeness awesomeness awesomeness <unk> awesomeness <unk> <unk> awesomeness awesomeness <unk> awesomeness <unk> awesomeness <unk> <unk> awesomeness awesomeness <unk> awesomeness <unk> awesomeness <unk> awesomeness awesomeness.