Word Embedding (Part II)Intuition and (some) maths to understand end-to-end GloVe modelMatyas AmroucheBlockedUnblockFollowFollowingApr 25The power of GloVeThe original issue of NLP (Natural Language Processing) is the encoding of a word/sentence into an understandable format for computer processing.
Representation of words in a vector space allows NLP models to learn word meaning.
In our previous post, we saw the Skip-gram model that captures the meaning of words given their local context.
Let’s remember that by context we mean a fixed window of size n of words surrounding a target word.
In this post, we are going to study the GloVe model (Global Vectors) which has been created to look at both local context and global statistics of words to embed them.
The main idea behind the GloVe model is to focus on the co-occurence probabilities (equation 0 below) of words within a corpus of texts to embed them in meaningful vectors.
In other terms, we are going to look at how often a word j appears in the context of a word i within all our corpus of texts.
To do so, let X be our word-word co-occurence matrix (co-occurence matrix example) and X_ij be the number of times word j appears in the context of word i.
Equation 0: The co-occurence probability of a word j to occur given a word i is the the ratio of the number of times word j occurs in the context of word i and the number of times any word appears in the context of word i.
GloVe will look at the ratio between those co-occurence probabilities to extract the inner meaning of words.
More specifically, we are going to focus on the last line of the table in Figure 1.
Figure 1: The 2 first rows of the table show the probabilities of the words solid, gas, water and fashion to occur in the context of the words ice and steam.
The last row shows the probabilities ratio which is the key learning under the hood of the GloVe model.
For words related to “ice” but not “steam” like “solid”, the ratio will be high.
On the opposite, for words related to “steam” but not “ice” the ratio will be low and for words that are related to both or none of them like “water” and “fashion”, the ratio will be close to 1.
At first glance, it quickly appears that the co-occurence probabilities ratio gathers more information than the raw probabilities and better captures the relationship between “ice” and “steam”.
Indeed, looking only at raw probabilities, the word “water” best represents the meaning of both “ice” and “steam” and we would not be able to distinguish those 2 words inner meaning.
Now that we have understood that co-occurence probabilities ratios capture relevant information about words’s relationship, the GloVe model aims to build a function F that will predict those ratios given two word vectors w_i and w_j and a context word vector w_k as inputs.
Equation 1: The GloVe model will learn meaningful word vectors representations w_i, w_j and w_k to feed F and correctly predict the probabilities ratios.
The reader willing to get a high level overview of GloVe might want to skip the following equations (from equation 2 to 6) which dive a bit deeper in the understanding of how the GloVe model constructs this F function to learn word vectors representation.
Let’s see how this F function is built, step by step, to catch the logic behind the final formula, which looks pretty complex at first sight (equation 6).
To compare vectors w_i and w_j which are linear structures, the most intuitive way is by subtracting them, so let’s do it.
Equation 2: Comparing two vectors by making the difference.
We have two vectors as inputs of F and a scalar on the right-hand side of the equation, mathematically speaking it add complexity to the linear structure we want to build if we keep it that way.
It is easier to associate scalar values to scalar values, this way we won’t have to play with vectors dimension.
Therefore GloVe model uses the dot product of the two inputs vectors.
Equation 3: Scalar values to scalar values thanks to the dot product.
All along we have separated word vectors from context words vectors.
However, this separation is only a matter of point of view.
Indeed, if “water” is a context word to “steam”, then “steam” can be a context word to “water”.
This symmetry of the X matrix (our co-occurence matrix) has to be taken into account when building F, we must be able to switch w_i and w_k.
First, we need F to be a homomorphism (F(a+b)=F(a)F(b)).
Equation 4: Using the homomorphism property of F to associate word vectors dot product (which can be interpreted as similarities between words) to the probability they occur in a same context.
The exp function is solution to the equation 4, exp(a-b)=exp(a)/exp(b), so let’s use it.
Equation 5: Almost symmetric if there was not b_i term.
To restore the symmetry, a bias b_k is added for the vector w_k.
Equation 6: We can express our word vectors given corpus statistics and symmetry is respected (we can switch w_i and w_k).
Thanks to our F function, we are now able to define a cost/objective function using our word vectors representations (equation 7).
During the training GloVe will learn the proper word vectors w_i and w_j to minimize this weighted least square problem.
Indeed, a weight function f(X_ij) must be used to cap the importance of very common co-occurrences (like “this is”) and to prevent rare co-occurrences (like “snowy Sahara”) to have the same importance as usual ones.
Equation 7: Final GloVe EquationIn summary, the GloVe model uses a meaningful source of knowledge for the word analogy task we asked him to perform: the co-occurence probabilities ratios.
Then, it builds an objective function J that associates word vectors to text statistics.
Finally, GloVe minimises this J function by learning meaningful word vectors representations.
Et voilà !References and other useful ressources:- The original Glove paper- Stanford NLP ressources- Well explained article comparing Word2vec vs Glove.