Must-Read Tutorial to Learn Sequence Modeling (deeplearning.ai Course #5)

Solving this gives us a 300 dimensional vector with a value equal to the embeddings of queen.

We can use a similarity function to determine the similarity between two word embeddings as well.

The similarity function is given by: This is a cosine similarity.

We can also use the Euclidean distance formula: There are a few other different types of similarity measures which you’ll find in core recommendation systems.

Embedding matrix We actually end up learning an embedding matrix when we implement a word embeddings algorithm.

If we’re given a vocabulary of 10,000 words and each word has 300 features, the embedding matrix, represented as E, will look like this: To find the embeddings of the word ‘orange’ which is at the 6257th position, we multiply the above embedding matrix with the one-hot vector of orange: E .

O6257 = e6257 The shape of E is (300, 10k), and of O is (10k, 1).

Hence, the embedding vector e will be of the shape (300, 1).

Part 2 – Learning Word Embeddings: Word2Vec & GloVe Learning Word Embeddings Consider we are building a language model using a neural network.

The input to the model is “I want a glass of orange” and we want the model to predict the next word.

We will first learn the embeddings of each of the words in the sequence using a pre trained word embedding matrix and then pass those embeddings to a neural network which will have a softmax classifier at the end to predict the next word.

This is how the architecture will look like.

In this example we have 6 input words, each word is represented by a 300 dimensional vector and hence the input of the sequence will be 6*300 = 1800 dimensional.

The parameters for this model are: Embedding matrix (E) W, b W, b We can reduce the number of input words, to decrease the input dimensions.

We can say that we want our model to use previous 4 words only to make prediction.

In this case the input will be 1200 dimensional.

The input can also be referred as context and there can be various ways to select the context.

Few possible ways are: Take last 4 words Take 4 words from left and 4 words from right Last 1 word We can also take one nearby word This is how we can solve language modeling problem where we input the context and predict some target words.

In the next section, we will look at how Word2Vec can be applied for learning word embeddings.

Word2Vec It is a simple and more efficient way to learn word embeddings.

Consider we have a sentence in our training set: I want a glass of orange juice to go along with my cereal.

We use a skip gram model to pick a few context and target words.

In this way we create a supervised learning problem where we have an input and its corresponding output.

For context, instead of having only last 4 words or last 1 word, we randomly pick a word to be the context word and then randomly pick another word within some window (say 5 to the left and right) and set that as the target word.

Some of the possible context – target pairs could be: Context Target orange juice orange glass orange my   These are only few pairs, we can have many more pairs as well.

Below are the details of the model: Vocab size = 10,000k Now, we want to learn a mapping from some context (c) to some target (t).

This is how we do the mapping: Oc -> E -> ec -> softmax -> y(hat) Here, ec = E.

Oc Here softmax is calculating the probability of getting the target word (t) as output given the context word (c).

Finally, we calculate the loss as: Using a softmax function creates a couple of problems to the algorithm, one of them is computational cost.

Everytime we calculate the probability: We have to carry out the sum over all 10,000 words in the vocabulary.

If we use a larger vocabulary of say 100,000 words or even more the computation gets really slow.

Few solutions to this problem are: Using a hierarchical softmax classifier.

So, instead of classifying some word into 10,000 categories in one go, we first classify it into either first 5000 categories or later 5000 categories, and so on.

In this way we do not have to compute the sum over all 10,000 words every time.

The flow of hierarchical softmax classifier looks like: One question that might arise in your mind is how to choose the context c?.One way could be to sample the context word at random.

The problem with random sampling is that the common words like is, the will appear more frequently whereas the unique words like orange, apple might not even appear once.

So, we try to choose a method which gives more weightage to less frequent words and less weightage to more frequent words.

In the next section we will see a technique that helps us to reduce the computation cost and learn much better word embeddings.

Negative Sampling In the skip gram models, as we have seen earlier, we map context words to target words which allows us to learn word embeddings.

One downside of that model was high computational cost due to softmax.

Consider the same example that we took earlier: I want a glass of orange juice to go along with my cereal.

What negative sampling will do is, it creates a new supervised learning problem, where given a pair of words say “orange” and “juice”, we will predict whether it is a context-target pair?.For the above example, the new supervised learning problem will look like: Context (c) Word (t) Target (y) orange juice 1 orange king 0 orange book 0 orange the 0   Since orange-juice is a context-target pair, we set the Target value as 1, whereas, orange-king is not a pair for above example, and hence Target is 0.

These 0 values represent that it is a negative sample.

We now apply a logistic regression to calculate the probability of whether the pair is a context-target pair or not.

The probability is given by: We can have k pair of words for training the model.

k can range between 5-20 for smaller dataset while for larger dataset, we choose smaller k (2-5).

So, if we build a neural network and the input is orange (one hot vector of orange): We will have 10,000 possible classification problems each corresponding to different words from the vocabulary.

So, this network will tell all the possible target words corresponding to the context word orange.

Here, instead of having one giant 10,000 way softmax, which is computationally very slow, we have 10,000 binary classification problems which is comparatively very slow as compared to the softmax.

Context word is chosen from the sequence and once it is chosen, we randomly pick another word from the sequence to be a positive sample and then pick few of the other random words from the vocabulary as negative samples.

In this way, we can learn word embeddings using simple binary classification problems.

Next we will see even simpler algorithm for learning word embeddings.

GloVe word vectors We will work on the same example: I want a glass of orange juice to go along with my cereal.

Previously, we were sampling pairs of words (context and target) by picking two words that appears in close proximity to each other from our text corpus.

GloVe or Global Vectors for word representation makes it more explicit.

Let’s say: Xij = number of times i appears in context of j Here, i is similar to the target (t) and j is similar to the context (c).

GloVe minimizes the following: Here, f(Xij) is the weighing term.

It gives less weightage to more frequent words (such as stop words like this, is, of, a, .

) and more weightage to less frequent words.

Also, f(Xij) = 0 when (Xij) = 0.

It has been found that minimizing the above equation finally leads to a good word embeddings.

Now, we have seen many algorithms for learning word embeddings.

Next we will see the application using word embeddings.

Part 3 – Applications using Word Embeddings Sentiment Classification You must already be well aware of what sentiment classification is so I’ll make this quick.

Check out the below table which contains some text and its corresponding sentiment: X (text) y (sentiment) The dessert is excellent.

**** Service was quite slow.

** Good for a quick meal, but nothing special.

*** Completely lacking in good taste *   The applications of sentiment classification are varied, diverse and HUGE.

But In most cases you’ll encounter, the training doesn’t come labelled.

This is where word embeddings come to the rescue.

Let’s see how we can use word embeddings to build a sentiment classification model.

We have the input as: “The dessert is excellent”.

Here, E is the pretrained embedding matrix of, say, 100 billion words.

We multiple the one-hot encoded vectors of each word with the embedding matrix to get the word representations.

Next, we sum up all these embeddings and apply a softmax classifier to decide what should be the rating of that review.

It only takes the mean of all the words, so if the review is negative but it has more positive words, then the model might give it a higher rating.

Not a great idea.

So instead of just summing up the embedding to get the output, we can use an RNN for sentiment classification.

This is a many-to-one problem where we have a sequence of inputs and a single output.

You are now well equipped to solve this problem.

????.  Module 3: Sequence Models & Attention Mechanism Welcome to the final module of the series!.Below are the two objectives we will primarily achieve in this module: Understanding the attention mechanism To understand where the model should focus its attention given an input sequence   Basic Models I’m going to keep this section industry relevant, so we’ll cover models which are useful for applications like machine translation, speech recognition, etc.

Consider this example – we are tasked with building a sequence-to-sequence model where we want to input a French sentence and translate it into English.

The problem will look like: Here x<1>, x<2> are the inputs and y<1>, Y<2> are outputs.

To build a model for this, we have an encoder part which takes an input sequence.

The encoder is built as an RNN, or LSTM, or GRU.

After the encoder part, we build a decoder network which takes the encoding output as input and is trained to generate the translation of the sentence.

This network is popularly used for Image Captioning as well.

As input, we have the image’s features (generated using a convolutional neural network).

Picking the most likely sentence The decoder model of a machine translation system is quite similar to that of a language model.

But there is one key difference between the two.

In a language model, we start with a vector of all zeros, whereas in machine translation, we have an encoder network: The encoder part of the machine translation model is a conditional language model where we are calculating the probability of outputs given an input: Now, for the input sentence: We can have multiple translations like: We want the best translation out of all the above sentences.

The good news?.There is an algorithm that helps us choose the most likely translation.

Beam Search This is one of the most commonly used algorithms for generating the most likely translations.

The algorithm can be understood using the below 3 steps: Step 1: It picks the first translated word and calculates its probability: Instead of just picking one word, we can set a bean width (B) to say B=3.

It will pick the top 3 words that can possibly be the first translated word.

These three words are then stored in the computer’s memory.

Step 2: Now, for each selected word in step 1, this algorithm calculates the probability of what the second word could be: If the beam width is 3 and there are 10,000 words in the vocabulary, the total number of possible combinations will be 3 * 10,000 = 30,000.

We evaluate all these 30,000 combinations and pick the top 3 combinations.

Step 3: We repeat this process until we get to the end of the sentence.

By adding one word at a time, beam search decides the most likely translation for any given sentence.

Let’s look at some of the refinements we can do to beam search in order to make it more effective.

Refinements to Beam Search Beam search maximizes this probability: This probability is calculated by multiplying probabilities of different words.

Since particular probabilities are very tiny numbers (between 0 and 1), if we multiply such small numbers multiple time, final output is very small which creates problem in computations.

So, instead we can use the following formula to calculate the probabilities: So, instead of maximizing the products, we maximize the log of a product.

Even using this objective function, if the translated sentence has more words, their product will go down to more negative values, and  hence we can normalize the function as: So, for all the sentences selected using beam search, we calculate this normalized log likelihood and then pick the sentence which gives the highest value.

There is one more detail that I would like to share and it is how to decide the beam width B?.If the beam width is more, we will have better results but the algorithm will become slow.

On the other hand, choosing smaller B will make the algorithm run faster but the results will not be accurate.

There is no hard rule to choose beam width and it can vary according to the applications.

We can try different values and then choose the one that gives the best results.

Error analysis in beam search Beam search is an approximation algorithm which outputs the most likely translations based on the beam width.

But it is not always necessary that it will generate the correct translation everytime.

If we are not getting the correct translations, we have to analyse whether it is due to the beam search or our RNN model is causing problems.

If we find that the beam search is causing the problem, we can increase the beam width and hopefully we will get better results.

How to decide whether we should focus on improving the beam search or the model?.Suppose the actual translation is: Jane visits Africa in September (y*) And the translation that we got from the algorithm is: Jane visited Africa last September (y(hat)) RNN will compute P(y* | x) and P(y(hat) | x) Case 1: P(y* | x) > P(y(hat) | x) This means beam search chose y(hat) but y* attains higher probability.

So, beam search is at fault and we might consider increasing the beam width.

Case 2: P(y* | x) <= P(y(hat) | x) This means that y* is better translation than y(hat) but RNN predicted the opposite.

Here, RNN is at fault and we have to improve the model.

So, for each translation, we decide whether RNN is at fault or the beam search.

Finally we figure out what fraction of errors are caused due to beam search vs RNN model and update either beam search or RNN model based on which one is more at fault.

In this way we can improve the translations.

Attention Model Up to this point, we have seen the encoder-decoder architecture for machine translation where one RNN reads the input and the other one outputs a sentence.

But when we get very long sentences as input, it becomes very hard for the model to memorize the entire sentence.

What attention models do is they take small samples from the long sentence and translate them, then take another sample and translate them, and so on.

We use an alpha parameter to decide how much attention should be given to a particular input word while we generate the output.

⍺<1,2> = For generating the first word, how much attention should be given to the second input word Let’s understand this with an example: So, for generating the first output y<1>, we take attention weights for each word.

This is how we compute the attention: If we have Tx input words and Ty output words, then the total attention parameters will be Tx * Ty.

You might already have gathered this – Attention models are one of the most powerful ideas in deep learning.

End Notes Sequence models are awesome, aren’t they?.They have a ton of practical applications – we just need to know the right technique to use in specific situations.

And my hope is that you will have learned those techniques in this guide.

Word embeddings are a great way to represent words and we saw how these word embeddings can be built and use.

We have gone through different applications of word embeddings and finally we covered attention models as well which are one of the most powerful ideas for building sequence models.

If you have any query or feedback related to the article, feel free to share them in the comments section below.