A Neural Implementation of NBSVM in Keras

A Neural Implementation of NBSVM in KerasArun MaiyaBlockedUnblockFollowFollowingJan 30NBSVM is an approach to text classification proposed by Wang and Manning¹ that takes a linear model such as SVM (or logistic regression) and infuses it with Bayesian probabilities by replacing word count features with Naive Bayes log-count ratios.

Despite its simplicity, NBSVM models have been shown to be both fast and powerful across a wide range of different text classification datasets.

In this article, we cover the following:An NBSVM model is implemented as a neural network using the deep learning framework, Keras.

Using the well-studied IMDb movie review dataset, we demonstrate that this Keras implementation achieves a test accuracy of 92.

5% with only a few seconds of training.

This is competitive with deeper and more sophisticated neural network architectures that take much longer to train.

It is 2.

1% away from the current state-of-the-art.

Source code and results are available in the form of a Jupyter notebook on GitHub here.

Let’s begin by importing some necessary modules.

import numpy as npfrom keras import backend as Kfrom keras.

models import Modelfrom keras.


core import Activationfrom keras.

layers import Input, Embedding, Flatten, dotfrom keras.

optimizers import Adamfrom sklearn.


text import CountVectorizerfrom sklearn.

datasets import load_filesLoading the IMDb DatasetThe IMDb training set consists of 25,000 movie reviews labeled as either positive or negative.

The test set consists of another 25,000 labeled movie reviews.

We will use the first set of 25,000 reviews to train a model to classify movie reviews as positive or negative and evaluate the model on the second set of 25,000 review.

The dataset is first loaded as a document-term matrix (DTM) where each row represents a review and each column represents a word spanning the entire vocabulary of the corpus.

Each “word” here is a string of either one, two, or three consecutive words in a review.

That is, features consist of unigrams, bigrams, and trigrams.

Entries in the matrix are binarized word counts (i.


, 1 means the word appears at least once in the review and 0 means otherwise).

The IMDb dataset is available for download here.

The PATH_TO_IMDb variable should be set to the full path of the extracted aclImdb folder.

We compute and load this document-term matrix for both the training and test set.


/data/aclImdb'def load_imdb_data(datadir): # read in training and test corpora categories= ['pos', 'neg'] train_b = load_files(datadir+'/train', shuffle=True, categories=categories) test_b = load_files(datadir+'/test', shuffle=True, categories=categories) train_b.

data = [str(x) for x in train_b.

data] test_b.

data = [str(x) for x in test_b.

data] veczr = CountVectorizer(ngram_range=(1,3), binary=True, token_pattern=r'w+', max_features=800000) dtm_train = veczr.


data) dtm_test = veczr.


data) y_train = train_b.

target y_test = test_b.

target print("DTM shape (training): (%s, %s)" % (dtm_train.

shape)) print("DTM shape (test): (%s, %s)" % (dtm_train.

shape)) num_words = len([v for k,v in veczr.


items()]) + 1 print('vocab size:%s' % (num_words)) return (dtm_train, dtm_test), (y_train, y_test), num_words(dtm_train, dtm_test), (y_train, y_test), num_words = load_imdb_data(PATH_TO_IMDB)Converting a Document-Term Matrix to Word ID SequencesIn a binarized document-term matrix, each document is represented as a long one-hot-encoded vector with most entries being zero.

While our neural model could be implemented to accept rows from this matrix as input, we choose to represent each document as a sequence of word IDs with some fixed length, maxlen, by using an embedding layer.

An embedding layer in a neural network acts as a lookup-mechanism that accepts a word ID as input and returns a vector (or scalar) representation of that word.

These representations can either be learned or preset.

Image SourceIn our case, the embedding layer will return preset Naive Bayes log-count ratios for the words represented by word IDs in a document.

A model accepting documents represented as sequences of word IDs trains much faster than one accepting rows from a term-document matrix.

While these two architectures technically have the same number of parameters, the look-up mechanism of an embedding layer reduces the number of features (i.


, words) and parameters under consideration at any iteration.

That is, documents represented as a fixed-size sequence of word IDs are much more compact and efficient than large one-hot encoded vector from a term-document matrix with binarized counts.

Here, we convert the document-term matrix to a list of word ID sequences.

def dtm2wid(dtm, maxlen): x = [] nwds = [] for idx, row in enumerate(dtm): seq = [] indices = (row.

indices + 1).


int64) np.

append(nwds, len(indices)) data = (row.



int64) count_dict = dict(zip(indices, data)) for k,v in count_dict.

items(): seq.

extend([k]*v) num_words = len(seq) nwds.

append(num_words) # pad up to maxlen with 0 if num_words < maxlen: seq = np.

pad(seq, (maxlen – num_words, 0), mode='constant') # truncate down to maxlen else: seq = seq[-maxlen:] x.

append(seq) nwds = np.

array(nwds) print('sequence stats: avg:%s, max:%s, min:%s' % (nwds.

mean(), nwds.

max(), nwds.

min()) ) return np.

array(x)maxlen = 2000x_train = dtm2wid(dtm_train, maxlen)x_test = dtm2wid(dtm_test, maxlen)Computing the Naive Bayes Log-Count RatiosThe final data preparation step involves computing the Naive Bayes log-count ratios.

This is more easily done using the original document-term matrix.

These ratios capture the probability of a word appearing in a document in one class (i.


, positive) versus another (i.


, negative).

def pr(dtm, y, y_i): p = dtm[y==y_i].

sum(0) return (p+1) / ((y==y_i).

sum()+1)nbratios = np.

log(pr(dtm_train, y_train, 1)/pr(dtm_train, y_train, 0))nbratios = np.


asarray(nbratios))NBSVM in KerasWe are now ready to define our NBSVM model.

Our model utilizes two embedding layers.

The first, as mentioned above, stores the Naive Bayes log-count ratios.

The second stores learned weights (or coefficients) for each feature (i.


, word) in this linear model.

Our prediction, then, is simply the dot product of these two vectors.

def get_model(num_words, maxlen, nbratios=None): # setup the embedding matrix for NB log-count ratios embedding_matrix = np.

zeros((num_words, 1)) for i in range(1, num_words): # skip 0, the padding value if nbratios is not None: # if log-count ratios are supplied, then it's NBSVM embedding_matrix[i] = nbratios[i-1] else: # if log-count ratios are not supplied, # this reduces to a logistic regression embedding_matrix[i] = 1 # setup the model inp = Input(shape=(maxlen,)) r = Embedding(num_words, 1, input_length=maxlen, weights=[embedding_matrix], trainable=False)(inp) x = Embedding(num_words, 1, input_length=maxlen, embeddings_initializer='glorot_normal')(inp) x = dot([r,x], axes=1) x = Flatten()(x) x = Activation('sigmoid')(x) model = Model(inputs=inp, outputs=x) model.

compile(loss='binary_crossentropy', optimizer=Adam(lr=0.

001), metrics=['accuracy']) return modelThis simple model achieves a 92.

5% accuracy on the IMDb test set with only a few seconds of training on a Titan V GPU.

In fact, the model trains within seconds even on a CPU.

Interestingly, this accuracy is higher than the result reported in the original paper¹ (which was only 91.

22% using bigram features).

model = get_model(num_words, maxlen, nbratios=nbratios)model.

fit(x_train, y_train, batch_size=32, epochs=3, validation_data=(x_test, y_test))These results are competitive with more sophisticated (and deeper) neural network architectures.

Moreover, this model outperforms a number of well-known approaches including Facebook’s fastText architecture.

Note that, when setting nbratios to None, our function get_model sets the embedding matrix, r, to all ones, which reduces the model to a logistic regression.

³ Such a logistic regression model yields a lower (but surprisingly respectable) accuracy of 91.

6% (versus 92.

5% for NBSVM).

Try it out yourself using our Jupyter notebook on GitHub available here.

This article was inspired by a tweet² from Jeremy Howard in September 2017.

References¹ Sida Wang and Christopher D.

Manning: Baselines and Bigrams: Simple, Good Sentiment and Topic Classification; ACL 2012.

² https://twitter.

com/jeremyphoward/status/905841365241565184?lang=en³ Since, by definition, any document containing a word ID contains that word, if our embedding layer simply returns one (instead of a log-count ratio) for every word ID except zero (where 0 is the dummy ID we used to pad sequences), then our NBSVM model reduces to a “vanilla” logistic regression.


. More details

Leave a Reply