Word2vec from Scratch with NumPy

Word2vec from Scratch with NumPyHow to implement a Word2vec model with Python and NumPyIvan ChenBlockedUnblockFollowFollowingFeb 17IntroductionRecently, I have been working with several projects related to NLP at work.

Some of them had something to do with training the company’s in-house word embedding.

At work, the tasks were mostly done with the help of a Python library: gensim.

However, I decided to implement a Word2vec model from scratch just with the help of Python and NumPy because reinventing the wheel is usually an awesome way to learn something deeply.

Word EmbeddingWord embedding is nothing fancy but methods to represent words in a numerical way.

More specifically, methods to map vocabularies to vectors.

The most straightforward method could be using one-hot encoding to map each word to a one-hot vector.

Although one-hot encoding is quite simple, there are several downsides.

The most notable one is that it is not easy to measure relationships between words in a mathematical way.

Word2vec is a neural network structure to generate word embedding by training the model on a supervised classification problem.

Such a method was first introduced in the paper Efficient Estimation of Word Representations in Vector Space by Mikolov et al.

,2013 and was proven to be quite successful in achieving word embedding that could used to measure syntactic and semantic similarities between words.

Word2vec (Skip-gram)In Mikolov et al.

,2013, two model architectures were presented, Continuous Bag-of-Words model and Skip-gram model.

I will be diving into the latter in this article.

In order to explain the skip-gram model, I randomly quote a piece of text from a book that I am currently reading, The Little Book of Common Sense Investing by John Bogle:After the deduction of the costs of investing, beating the stock market is a loser’s game.

As I have mentioned above, it is a supervised classification problem that the word2vec model tries to optimize.

More specifically, given a “context word”, we want to train a model such that the model can predict a “target word”, one of the words appeared within a predefined window size from the context word.

Take the above sentence for example, given a context word “investing” and a window size of 5, we would like the model to generate one of the underlying words.

(one of the words in [deduction, of, the costs, beating, stock, market, is] in the case.

)Model OverviewThe following shown the original diagram presented in the paper by Mikolov et al.

,2013.

I made another graph with a little bit more detailsThe word embedding layer is essentially a matrix with a shape of (# of unique words in the corpus, word embedding size).

Each row of the matrix represent a word in the corpus.

Word embedding size is a hyper-parameter to be decided and can be thought as how many features that we would like to use to represent each word.

The latter part of the model is simply a logistic regression in a neural network form.

In the training process, the word embedding layer and the dense layer are being trained such that the model is able to predict target words given a context word at the end of the training process.

After training such a model with a huge amount of data, the word embedding layer will end up becoming a representation of words which could demonstrate many kinds of cool relationships between words in a mathematical way.

(Those who are interested in more details can refer to the original paper.

)Implementation from Scratch with PythonPreparation for Training DataTo generate training data, we tokenize text first.

There are many techniques out there when it comes to tokenize text data, such as getting rid of words appearing in very high or very low frequency.

I just split the text with a simple regex since the focus of the article is not tokenization.

Next, we assign an integer to each word as its id.

In addition, using word_to_id and id_to_word to record the mapping relationships.

Eventually, we generate training data for the model.

For each context word tokens[i], generate: (tokens[i], tokens[i-window_size]), .

, (tokens[i], tokens[i-1]), (tokens[i], tokens[i+1]), .

, (tokens[i], tokens[i+window_size]).

Take context word investing with a window size of 5 for example, we will generate(investing, deduction), (investing, of), (investing, the), (investing, costs), (investing, of), (investing, beating), (investing, the), (investing, stock), (investing, market), (investing, is) .

Note: In the code, the training (x, y) pairs are represented in word ids.

The follow is the code for generating training data:Overview of Training ProcessAfter generating training data, let’s move on to the model.

Similar to the majority of neural network models, the steps to train the word2vec model are initializing weights (parameters that we want to train), propagating forward, calculating the cost, propagating backward and updating the weights.

The whole process will be repeated for several iterations based on how many epochs we want to train.

Initialization of parameters to be trainedThere are two layers in the model needed to be initialized and trained, the word embedding layer and the dense layer.

The shape of the word embedding will be (vocab_size, emb_size) .

Why is that?.If we’d like to use a vector with emb_size elements to represent a vocabulary and the total number of vocabularies our corpus is vocab_size, then we can represent all the vocabularies with a vocab_size x emb_size matrix with each row representing a word.

The shape of the dense layer will be (vocab_size, emb_size) .

How come?.The operation that would be performed in this layer is a matrix multiplication.

The input of this layer will be (emb_size, # of training instances)and we’d like the output to be (vocab_size, # of training instances)(For each word, we would like to know what the probability that the word appears with the given input word).

Note: I do not include biases in the dense layer.

The following is the code for initialization:Forward passThe Upper part shows the forward propagation.

The are three steps in the forward propagation, obtaining input word’s vector representation from word embedding, passing the vector to the dense layer and then applying softmax function to the output of the dense layer.

In some literatures, the input is presented as a one-hot vector (Let’s say an one-hot vector with i-th element being 1).

By multiplying the word embedding matrix and the one-hot vector, we can get the vector representing the input word.

However, the result of performing matrix multiplication is essentially the same as selecting the ith row of the word embedding matrix.

We can save lots of computational time by simply selecting the row associating with the input word.

The rest of the process is just a multi-class linear regression model.

The following graph could be used to recall the main operation of the dense layer.

Afterwards, we apply softmax function to the output of the dense layer which gives us the probability of each word appearing near the given input word.

The following equation could be used to remind what softmax function is.

The following is code for forward propagation:Computation of Cost (L)Here, we would use cross entropy to calculate cost:The following is code for cost computation:Backward pass (Back propagation)During the back propagation process, we would like to calculate gradients of the trainable weights with respect to the loss function and update the weight with its associated gradient.

Back propagation is the methods used to calculate those gradients.

It is nothing fancy but chain rule in Calculus:It is the weights in the dense layer and the word embedding layer that we would like to train.

Therefore we need to calculate gradients for those weights:The next step is to update the weights with the following formula:The following is code for backward propagation:Note: You might have been wondering why there is a factor of 1/m in dL_dW while not in dL_dword_vec .

In each pass, we process m training examples together.

For weights in the dense layer, we would like to update them with the average of the m gradient descents.

For weights in the word vector, each vector has its own weights which lead to its own gradient descent so we do not need to aggregate the m gradient descents while we updating.

Model TrainingTo train the model, repeat the process of forward propagation, backward propagation and weight updating.

During the training, the cost after each epoch should have decreasing trend.

The following is the code for training the model:EvaluationAfter train the model with data generated from the example sentence above with a window size of 3 for 5000 epochs (with a simple learning rate decay), we can see the model can output most neighboring words given each word as an input word.

You can find the end-to-end process in this notebook.

Feel free to download and play around.

It is my first Medium post.

Thank you guys for reading.

Feel free to provide me with feedback or to ask me questions.

It’s fine to stop reading here but if you are interested in some optimization that I found needed when I tried to train with a huger dataset, please continue to read.

Optimization (Space)When I tried to train the model above with a larger dataset, I found the memory consumption kept increasing during the training process and the python kernel finally shut down.

Later on, I figured out the issue had to do with the way I fed the labels Y into the model.

In the original code, each label is a one hot vector which used a bunch of zeros and a single one to represent the labeled output word.

When the vocabulary size grows bigger, we waste so much memory to the zeros that do not provide us useful information.

The memory consumption problem goes away after I start to feed the label with its associated word ind only.

We have decreased the space from O(vocabulary size * m) to O(m).

The following is my code implementation (There are only 2 places needed to be changed):In order optimize training time, the regular softmax above can be replaces with hierarchical softmax.

However, this article has been a little too long here so we will save the topic for next time.

Thank you for reading more.

Again, please feel free to provide me with feedback or to ask me questions.

Updates[2019–02–17] In case that you were interested how to derive the gradient for the output of dense layer with respect to the loss, here is a link to a written note of my derivative: https://github.