This is a common problem in NLP but thankfully it has an easy fix: smoothing.

This technique consists in adding a constant to each count in the P(w_i|c) formula, with the most basic type of smoothing being called add-one (Laplace) smoothing, where the constant is just 1.

Add-one/Laplace smoothingThis solves the zero probabilities problem and we will see later just how much it impacts the accuracy of our model.

ImplementationWe will implement our classifier in the form of a NaiveBayesClassifier class.

We will split the algorithm into two essential parts, the training and classifying.

TrainingIn this phase we provide our classifier with a (preferably) large corpus of text, denoted as D, which computes all the counts necessary to compute the two terms of the reformulated.

Pseudocode for Naive Bayes trainingWhen implementing, although the pseudocode starts with a loop over all classes, we will begin by computing everything that doesn't depend on class c before the loop.

This is the case for N_doc, the vocabulary and the set of all classes.

Helper functionsSince the bigdoc is required when computing the word counts we also calculate it before the loop.

Train functionWithin the loop we just follow the order as given in the pseudocode.

First, we count the number of documents from D in class c.

Then we calculate the logprior for that particular class.

Next, we make a loop over our vocabulary so that we can get a total count for the amount of words within class c.

Finally we compute the log-likelihoods of each word for class c using smoothing to avoid division-by-zero errors.

ClassifyingWhen the training is done we have all the necessary values to make a prediction.

This will simply consists in taking a new (unseen) document and computing the probabilities for each class that has been observed during training.

Pseudocode for the classification partWe initialize the sums dictionary where we will store the probabilities for each class.

We always compute the probabilities for all classes so naturally the function starts by making a loop over them.

For each class c we first add the logprior, the first term of our probability equation.

The second term requires us to loop over all words, and increment the current probability by the log-likelihood of each.

Prediction implementationOnce this is done, we can just get the key of maximum value of our dictionary and voilà, we have a prediction.

We are now ready to see Naive Bayes in action!DataWe will test our model on a dataset with 1000 positive and 1000 negative movie reviews.

Each document is a review and consists of one or more sentences.

We split the data into a training set containing 90% of the reviews and a test set with the remaining 10%.

As the name implies, the former is used for training the model with our train function, while the latter will give us an idea how well the model generalizes to unseen data.

Once that is done, we need some sort of baseline to compare the accuracy of our model with, otherwise we can’t really tell how good it is doing.

Since this is a binary classification task, we at least know that random guessing should net us an accuracy of around 50%, on average.

Anything close to this number is essentially random guessing.

ResultsLet’s take a look at the full implementation of the algorithm, from beginning to end.

Yes, that’s it!.All we had to do was create the classifier, train it and use the validation set to check its accuracy.

I omitted the helper function to create the sets and labels used for training and validation.

Let’s see how our model does without smoothing, by setting alpha to 0 and running itPredicted correctly 101 out of 202 (50.

0%)Ran in 1.

016 secondsEugh.

that’s disappointing.

One would expect to do at the very least slightly better than average even without smoothing.

Let’s add smoothingPredicted correctly 167 out of 202 (82.

67327%)Ran in 0.

961 secondsNow that is some accuracy! Smoothing makes our model good enough to correctly classify at least 4 out of 5 reviews, a very nice result.

We also see that training and predicting both together take at most 1 second which is a relatively low runtime for a dataset with 2000 reviews.

ConclusionAs we could see, even a very basic implementation of the Naive Bayes algorithm can lead to surprisingly good results for the task of sentiment analysis.

Notice that this model is essentially a binary classifier, meaning that it can be applied to any dataset in which we have two categories.

There are all kinds of applications for it, ranging from spam detection to bitcoin trading based on sentiment.

With an accuracy of 82% there is really a lot that you could do, all you need is a labeled dataset and of course, the larger it is, the better!If you are interested in AI, feel free to check out my github: https://github.

com/filipkny/MediumRare.

I’ll be putting the source code together with the data there so that you can test it out for yourself.

Thank you for reading :).