Naive Bayes Algorithm: Intuition and Implementation

Naive Bayes Algorithm: Intuition and ImplementationVictor RomanBlockedUnblockFollowFollowingFeb 7Introduction: What Are Naive Bayes Models?In a broad sense, Naive Bayes models are a special kind of classification machine learning algorithms.

They are based on a statistical classification technique called ‘Bayes Theorem’.

Naive Bayes model are called ‘naive’ algorithms becaused they make an assumption that the predictor variables are independent from each other.

In other words, that the presence of a certain feature ina dataset is completely unrelated to the presence of any other feature.

They provide aneasy way to build accurate models with very good performance given their simplicity.

They do this by providing a way to calculate the ‘posterior’ probability of a certain event A to occur, given some probabilities of ‘prior’ events.

ExampleWe will introduce the main concepts regarding Navive Bayes algorithm, by studying an example:Let’s consider the case of two colleagues that work in the same office: Alice and Bruno.

And we know that:Alice comes to the office 3 days a week.

Bruno comes to the office 1days a week.

This will be our ‘prior’ information.

We are at the office and we see passing across us someone very fast, so fast that we don’t know who the person is: Alice or Bruno.

Given the information that we have until know and assuming that they only work 4 days a week, the probabilities of the person seen to be either Alice or Bruno are:P(Alice) = 3/4 = 0.

75P(Bruno) = 1/4 = 0.

25When we saw the person passing by, we saw that he/she was wearing a red jacket.

We also know the following:Alice wears red 2 times a week.

Bruno wears red 3 times a week.

So for every workweek, that has 5 days, we can infer the following:The probability of Alice to wear red is → P(Red|Alice) = 2/5 = 0.

4The probability of Bruno to wear red is → P(Red|Bruno) = 3/5 = 0.

6This new probabilities will be the ‘posterior’ information.

So, with this information, who did we see passing by?Initially, we knew the probability of P(Alice) and P(Bruno), and later we infered the probabilties of P(Red|Alice) and P(Red|Bruno).

So, the real probabilities are:Formally, the previous graphic would be:Supervised Naive Bayes AlgorithmThe steps to perform in order to be able to use the Naive Bayes Algorithm to solve classification problems like the previous problem is:Convert dataset into a frequency tableCreates a likelihood table by finding the probabilities of the events to occur.

The Naive Bayes equation is used to compute the posterior probability of each class.

The class with the higher posterior probability is the outcome of the prediction.

Strengths and Weaknesses of Naive BayesThe main strengths are:Easy and quick way to predict classes, both in binary and multiclass classification problems.

In the cases that the independence assumption fits, the algorithm performs better compared to other classification models, even with less training data.

The decoupling of the class conditional feature distributions means that each distribution can be independently estimated as a one dimensional distribution.

This helps with problems derived from the curse of dimensionality and improve the performance.

+Whereas the main disadvantages of using this method are:Although they are pretty good classifiers, naive bayes are know to be poor estimators.

So the probability that outputs from it shouldn’t be taken very seriously.

The naive assumption of independence is very unlikely to match real-world data.

When the test data set has a feature that has not been observed in the training se, the model will assign a 0 probability to it and will be useless to make predictions.

One of the main methods to avoid this, is the smoothing technique, being the Laplace estimation one of the most popular ones.

Implementation Project: Spam DetectorCurrently, one of the principal applications of Machine Learning is spam detection.

Almost all of the most important email services provide a spam detector that automatically classifies spam and sends it to the “Junk Mail Box”.

In this project, we will develop a Naive Bayes model that classify SMS messages as spam or not.

It will be based on training data provided by us.

By doing previous research we find out that usually spam messages:Contain words like: ‘free’, ‘win’, ‘winner’, ‘cash’ and ‘prize.

In addition, they tend to have words written in all capitals and also tend to use a lot of exclamation marks.

This is a supervised binary classification problem, as the messages are either ‘Spam’ or ‘Not spam’ and we will feed a labelled dataset to train the model.

OverviewWe will perform the following steps:Understand the datasetData preprocessingIntroduction to Bag of Words (BoW) and Sci-kit implementationSplitting Dataset in training and testing setApply BoW to process our dataset.

Naive Bayes Implementation with sci-kit learnModel evaluationConclussionUnderstand the DatasetWe will be using a dataset form the UCI Machine Learning repository.

You can find the link here.

A preview of the data:The columns are not named, but as we can guess from reading them:The first column determines the class of the message: either ‘spam’ or ‘ham’.

The second column corresponds to the message’s content.

We will first import the dataset and rename the column names.

By doing some previous exploration wwe also find out that the dataset is separated by tab, so theactual separator is ‘ ’.

# Import Pandas libraryimport pandas as pd# Dataset from https://archive.

ics.

uci.

edu/ml/datasets/SMS+Spam+Collectiondf = pd.

read_table('smsspamcollection/SMSSpamCollection', sep=' ', names=['label','sms_message'])# Output printing out first 5 rowsdf.

head()Data PreprocessingNow, as sci-kit learn only handles numerical values as inputs, we will convert our labels to binary variables.

0 will represent ‘ham’ and 1 will represent ‘spam’.

To perform the conversion:# Conversiondf['label'] = df.

label.

map({'ham':0, 'spam':1})# Print dataset shapedf.

shape()Introduction to Bag of Words (BoW) and Sci-kit implementationOur dataset is a large collection fo text data (5572 rows).

As our model will only accept numerical data as input, we should process the text messages.

Here is where the Bag of Words comes into play.

Bag of Words is a term used to specify the problems that have a collection of text data that needs to be processed.

The idea is to take a piece of the text and count the frequency of the words in the text.

BoW treats each word indepently and the order isn’t relevant.

We can convert a collection of documents to a matrix, with each document being a row and each word(token) being the column, and the corresponding (row,column) values being the frequency of occurrence of each word or token in that document.

As ann example if we have the following 4 documents:['Hello, how are you!', 'Win money, win from home.

', 'Call me now', 'Hello, Call you tomorrow?']We will convert the text to a frequency distribution matrix as the following:The documents are numbered in the rows, and each word is a column name, with the corresponding value being the frequency of that word in the document.

We will use the sklearns CountVectorizer method which does the following:It tokenizes the string(separates the string into individual words) and gives an integer ID to each token.

It counts the occurrence of each of those tokens.

It automatically converts all tokenized words to their lower case form so that it does not treat words like ‘He’ and ‘he’ differently.

It also ignores all punctuation so that words followed by a punctuation mark (for example: ‘hello!’) are not treated differently than the same words not prefixed or suffixed by a punctuation mark (for example: ‘hello’).

The third parameter to take note of is the stop_words parameter.

Stop words refer to the most commonly used words in a language.

They include words like 'am', 'an', 'and', 'the' etc.

By setting this parameter value to english, CountVectorizer will automatically ignore all words(from our input text) that are found in the built in list of english stop words in scikit-learn.

The sci-kit learn implementation would be the following:# Define the documentsdocuments = ['Hello, how are you!', 'Win money, win from home.

', 'Call me now.

', 'Hello, Call hello you tomorrow?']# Import the count vectorizer and initialize itfrom sklearn.

feature_extraction.

text import CountVectorizercount_vector = CountVectorizer()# Print the 'count_vector' object which is an instance of 'CountVectorizer()'print(count_vector)To fit the document dataset to the CountVectorizer object created we will use the fit() methd, and get the list of words that have been categorized as features using the get_feature_names() method.

This method returns our feature names for this dataset, which is the set of words that make up our vocabulary for ‘documents’.

count_vector.

fit(documents)names = count_vector.

get_feature_names()namesNext, we want to create a matrix with the rows being each of the 4 documents, and the columns being each word.

The corresponding (row, column) value will be the frequency of occurrence of that word(in the column) in a particular document(in the row).

We can do this using the transform() method and passing in the document data set as the argument.

The transform() method returns a matrix of numpy integers, you can convert this to an array using toarray().

doc_array = count_vector.

transform(documents).

toarray()doc_arrayTo make it easier to understand, our next step is to convert this array into a dataframe and name the columns appropriately.

frequency_matrix = pd.

DataFrame(data=doc_array, columns=names)frequency_matrixWe have now successfully implemented a Bag of Words problem for a document dataset that we created.

One potential issue that can arise from using this method out of the box is the fact that if our dataset of text is extremely large, there will be certain values that are more common that others simply due to the structure of the language itself.

So for example words like ‘is’, ‘the’, ‘an’, pronouns, grammatical constructs etc could skew our matrix and affect our analyisTo mitigate this, we will use the stop_words parameter of the CountVectorizer class and set its value to english.

Splitting Dataset in Training and Testing SetsWe want to split our data so it has the following shape:X_train is our training data for the 'sms_message' column.

y_train is our training data for the 'label' columnX_test is our testing data for the 'sms_message' column.

y_test is our testing data for the 'label' column Print out the number of rows we have in each our training and testing data.

# split into training and testing setsfrom sklearn.

model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(df['sms_message'], df['label'], random_state=1)print('Number of rows in the total set: {}'.

format(df.

shape))print('Number of rows in the training set: {}'.

format(X_train.

shape))print('Number of rows in the test set: {}'.

format(X_test.

shape))Apply BoW to process our datasetNow that we have split the data, the next objective is toconvert our data into the desired matrix format.

To do this we will be using CountVectorizer() as we did before.

There are two steps to consider here:Firstly, we have to fit our training data (X_train) into CountVectorizer() and return the matrix.

Secondly, we have to transform our testing data (X_test) to return the matrix.

Note that X_train is our training data for the 'sms_message' column in our dataset and we will be using this to train our model.

X_test is our testing data for the 'sms_message' column and this is the data we will be using(after transformation to a matrix) to make predictions on.

We will then compare those predictions with y_test in a later step.

The code for this segment is in 2 parts.

Firstly, we are learning a vocabulary dictionary for the training data and then transforming the data into a document-term matrix; secondly, for the testing data we are only transforming the data into a document-term matrix.

# Instantiate the CountVectorizer methodcount_vector = CountVectorizer()# Fit the training data and then return the matrixtraining_data = count_vector.

fit_transform(X_train)# Transform testing data and return the matrix.

Note we are not fitting the testing data into the CountVectorizer()testing_data = count_vector.

transform(X_test)Naive Bayes Implementation with Sci-Kit LearnWe will be using the multinomial Naive Bayes implementation.

This particular classifier is suitable for classification with discrete features (such as in our case, word counts for text classification).

It takes in integer word counts as its input.

On the other hand Gaussian Naive Bayes is better suited for continuous data as it assumes that the input data has a Gaussian(normal) distribution.

We will import the MultinomialNB classifier and fit the training data into the classifier using fit().

from sklearn.

naive_bayes import MultinomialNBnaive_bayes = MultinomialNB()naive_bayes.

fit(training_data, y_train)Now that our algorithm has been trained using the training data set we can now make some predictions on the test datastored in ‘testing_data’ using predict().

predictions = naive_bayes.

predict(testing_data)Now that predictions have been made on our test set, we need to check the accuracy of our predictions.

Model evaluationThere are various mechanisms for doing so, but first let’s do quick recap of them.

Accuracy: Measures how often the classifier makes the correct prediction.

It’s the ratio of the number of correct predictions to the total number of predictions (the number of test data points).

Precision: Tells us what proportion of messages we classified as spam, actually were spam.

It is a ratio of true positives(words classified as spam, and which are actually spam) to all positives(all words classified as spam, irrespective of whether that was the correct classification).

Precision = [True Positives/(True Positives + False Positives)]Recall(sensitivity): Tells us what proportion of messages that actually were spam were classified by us as spam.

It is a ratio of true positives(words classified as spam, and which are actually spam) to all the words that were actually spam.

Recall = [True Positives/(True Positives + False Negatives)]For classification problems that are skewed in their classification distributions like in our case, for example if we had a 100 text messages and only 2 were spam and the rest 98 weren’t, accuracy by itself is not a very good metric.

We could classify 90 messages as not spam(including the 2 that were spam but we classify them as not spam, hence they would be false negatives) and 10 as spam(all 10 false positives) and still get a reasonably good accuracy score.

For such cases, precision and recall come in very handy.

These two metrics can be combined to get the F1 score, which is weighted average of the precision and recall scores.

This score can range from 0 to 1, with 1 being the best possible F1 score.

We will be using all 4 metrics to make sure our model does well.

For all 4 metrics whose values can range from 0 to 1, having a score as close to 1 as possible is a good indicator of how well our model is doing.

from sklearn.

metrics import accuracy_score, precision_score, recall_score, f1_scoreprint('Accuracy score: ', format(accuracy_score(y_test, predictions)))print('Precision score: ', format(precision_score(y_test, predictions)))print('Recall score: ', format(recall_score(y_test, predictions)))print('F1 score: ', format(f1_score(y_test, predictions)))ConclusionOne of the major advantages that Naive Bayes has over other classification algorithms is its ability to handle an extremely large number of features.

In our case, each word is treated as a feature and there are thousands of different words.

Also, it performs well even with the presence of irrelevant features and is relatively unaffected by them.

The other major advantage it has is its relative simplicity.

Naive Bayes works well right out of the box and tuning it’s parameters is rarely ever necessary.

It rarely overfits the data.

Another important advantage is that its model training and prediction times are very fast for the amount of data it can handle.

.