Unifying Word Embeddings and Matrix Factorization — Part 1The problems of viewing Word2vec as a neural network, and reviewing Levy & Goldberg’s attempted solution.
Kian Kenyon-DeanBlockedUnblockFollowFollowingFeb 27TL;DRIn this 3-part blog series we present a unifying perspective on pre-trained word embeddings under a general framework of matrix factorization.
The most popular word embedding model, Word2vec, has traditionally been presented as a (shallow) neural network.
By the end of this blog post series, you will have learned how to compute the same Word2vec embeddings with a single matrix factorization.
In the following parts, we’ll show you how to simply compute that matrix factorization using a standard deep learning library such as TensorFlow.
This series is based on my recent work pursued at Mila (Quebec AI Institute) with Edward Newell (PhD), currently under review.
It proceeds as follows:Part 1: In this post, we will offer a reflection motivating word embeddings, followed by an explanation of the most relevant literature and findings relating Word2vec and matrix factorization (MF).
We will then move in the direction of the MF reformulation of Word2vec.
Part 2: Next time, we will describe the full correct MF-generalization for Word2vec, providing the exact form of the algorithm, noting that our generalization encompasses many other embedding algorithms, including GloVe, FastText, and Swivel.
Part 3: In the last part of this series, we will present practical experimental results verifying the correctness of our explicit MF formulation of Word2vec, additionally providing example code on how simply implementable MF is by using existing deep learning libraries.
IntroductionThe year 2018 was accompanied with large advances in Natural Language Processing in the form of contextualized word embeddings.
Models such as ELMO and BERT¹ have received considerable attention (no pun intended) from researchers and industry professionals alike².
However, pre-trained word embeddings are still very much ubiquitous in research and industry.
Contextualized embedding models introduce considerable memory & hardware requirements that may be a bottleneck in low-resource settings, such as mobile NLP models (indeed, fine-tuning BERT-base requires 12GB of GPU RAM³, while storing 10,000 uncompressed 300-dimensional word embeddings requires only 24MB).
It is thus reasonable to, at a minimum, offer a clarifying perspective and practical improvements on the standard algorithms that produce pre-trained word embeddings.
SGNS and Matrix FactorizationHere we present the most relevant details of the SGNS and GloVe algorithms with respect to matrix factorization, but note that these algorithms possess many particular design decisions that are not discussed here.
For a very thorough treatment of such design decisions, I highly recommend this 2016 TAACL paper.
A Brief Reflection: Motivating Word EmbeddingsYou shall know a word by the company it keeps.
(Firth, 1957)While it may be cliché to quote Firth, it is nonetheless insightful to reflect on this famous sentence.
The justification of building pre-trained word embeddings is that such vector representations should possess generic linguistic information that can be useful in downstream tasks.
If word embeddings can anticipate the general patterns of the language, then we would imagine that they can be generally useful for all kinds of NLP problems.
Thus, the most generally applicable task for NLP is to create vector embeddings of words that can predict words that can appear around them.
With Word2vec, or, as it is almost always used in implementation, Skip-gram with Negative Sampling (SGNS)⁴, the objective of context prediction is directly and locally imposed within (what is presented as) a shallow neural network.
Meanwhile, matrix factorization methods such as SVD⁵ and the weighted least-squares optimization of GloVe⁶ work directly on global pre-computed corpus statistics.
These latter methods implicitly construct word embeddings that predict context; or, we could equivalently state (as will be discussed later) that SGNS implicitly constructs word embeddings that predict the global corpus statistics.
Overview of Skip-Gram with Negative Sampling (SGNS)SGNS is widely discussed and presented as a neural network model.
The main objective for SGNS is to predict all of the context words given the current word.
This is a very difficult objective!.In the diagram below, we would be optimizing for the model to use the word “a” to predict each of: “am”, “I”, “neural”, “network”!.This becomes even more difficult when we realize that the vocabulary size is often on the order of hundreds of thousands, if not millions, meaning that, to predict each of the words, we would have to produce a million-dimensional prediction vector and apply the softmax operation upon it at every update!As such, the creators of Skip-gram with negative sampling (SGNS) introduced the “negative sampling” part to deal with this problem.
This is used to randomly sample words that do not appear in the context of the current word; from this, the model attempts to minimize their predicted probability of occurring, acting as a sort of repulsive force that prevents it from diverging to make extremely large vectors that would maximize the objective otherwise.
Therefore, instead of requiring a million-dimensional prediction vector, all that is needed is to draw k negative samples, a hyper-parameter typically between 5 and 15, at each step of learning instead.
This works remarkably well, and brings the computational complexity down from linear (in the vocabulary) to constant-time!An example of the typical SGNS diagram for sentence “am I a neural network”, with my annotations.
When described as a neural network, SGNS is often accompanied with a diagram that looks like the one above.
However, this diagram is somewhat misleading as it suggests that the model actually does predict each of the context words.
If SGNS performed the softmax operation — in which case we would be multiplying the word embedding w with the entire context embedding matrix C— then this diagram would be more correct.
However, the point of negative sampling (the NS in SGNS!) is to avoid performing this expensive softmax operation over the entire vocabulary.
Indeed, as described in their paper⁴, the SGNS objective is not concerned with maximizing “the log probability of the softmax”.
Moreover, SGNS does not use any non-linear activations in the so-called “hidden” layer, which is really just representing an indexed embedding matrix — this all strongly suggests that we are not in a useful conceptual framework when presenting Word2vec as a neural network.
Reframing SGNS as Matrix FactorizationWe will now propose a different conceptual framework for understanding Word2vec.
But let’s first formalize some notation before diving in to the details:Sigma (σ) indicates the common logistic sigmoid function;A word embedding is denoted with w, while a context embedding is denoted with c.
Note that Word2vec, like GloVe, needs separate word and context embeddings even though they correspond to the same vocabulary.
This is analogous to why we need to separate feature vectors (words) from learned weight vectors (contexts) in logistic regression; the only difference is that here we are learning both the features and the weights.
The summation sign below indicates that we are negatively sampling k context word indexes n from the unigram noise distribution U.
We can reframe SGNS by examining its objective function, defined as follows; consider a single word-context pair (j, j+i) encountered while reading through the corpus; e.
, for word “a” (j=2) and context word “neural” (i=1):SGNS objective for a single word-context pair (j, j+i)The intuitions of this loss function are as follows: we want to (1) maximize the dot product between w and c, which is naturally attenuated by the sigmoid function to cause decayed gradients as the dot product gets larger, representing the probability that c appears in w’s context; simultaneously, we would like to (2) minimize the dot product (also attenuated) between a w and the k randomly sampled words that does not appear in the context (that is, the negative sample vector c_n).
[Aside: Such behaviours of loss functions —with maximization and minimization, attraction and repulsion — often emerge in machine learning.
See my recent workshop paper at AAAI 2019 for theoretical discussion and empirical exploration of this phenomenon in the case of deep learning.
]As we can see, the objective function does not actually include the context window in it — the context window size is a hyperparameter.
Therefore, I would like to propose the following diagram to represent SGNS:My personal depiction of SGNS.
Note that the σ is an abstraction indicating both the sigmoid of the dot product of a w and c, along with the general contribution of the loss function.
In my view, this more clearly depicts how SGNS functions than the traditional diagram.
This shows that SGNS does not actually make any predictions at all, unlike the standard diagram.
Fundamentally, this model and the loss function above depict exactly what happens to the embeddings at every step of sampling in the stochastic gradient descent.
In layman’s terms, we see exactly how the model learns every time it reads one word and looks at its surrounding context.
However, this clearly doesn’t look like a neural network, as it is popularly described, so what is it?.Levy & Goldberg offer an answer.
Levy & Goldberg: Local Prediction is Global ApproximationWord2vec (SGNS) is frequently discussed and presented as a neural network; well-informed writers will indicate that it is a “shallow” network.
The most well-informed will be familiar with Levy and Goldberg’s 2014 paper⁷ “Neural word embedding as implicit matrix factorization”.
The authors prove that, despite seeming like a local neural network on the surface, SGNS vectors are implicitly trained to factorize a matrix containing global corpus statistics, similarly to GloVe and singular value decomposition.
This solution is derived by accumulating all of the single-step updates performed by SGNS, which we will dive into in significant detail in Part 2.
The matrix of global corpus statistics is defined, using the notation described above, as the shifted point-wise mutual information matrix, M:The target dot product objective for word vector j and context vector i.
where k is a hyperparameter indicating the number of negative samples used (the default implementation uses k=15).
PMI is a commonly used statistic in NLP that measures how far the observations obtained on a word-context pair is from what would be expected if the words were independent of each other; e.
, the words “costa” and “rica” will have a high PMI because if you see one of them your likelihood of seeing the other is substantially increased.
We will provide a more detailed discussion of PMI and its importance in Part 2.
The equation for PMI is below — note that, if a word-context pair never occurs in the corpus (e.
, their probability P(i,j) = 0), then PMI will equal log(0), or, negative infinity.
Point-wise mutual information between word-context pair i and j; the probability that the occur together (bigram statistic) divided by their unigram probabilities (unigram statistic) — see Part 2 for more on this!So, Levy & Goldberg tell us that Word2vec is implicitly doing matrix factorization of a huge matrix filled these PMI statistics computed from the corpus.
What this intuitively means is that SGNS wants to make word and context embeddings such that their dot product represents how likely it is that they occur together — that is, we shall define the word embedding by the context it keeps across the entire corpus.
This discovery is extremely rich in theoretical rigour, with many important consequences.
The most immediate consequence is that it now seems like we can implement SGNS with a matrix factorization algorithm, so long as we do the following:Precompute the correct shifted-PMI matrix from the corpus; and,Determine the correct corresponding loss function for the factorization.
Naturally, after discovering that SGNS is performing implicit matrix factorization, Levy and Goldberg attempted to replicate SGNS with an explicit matrix factorization formulation.
They proceed as follows:First, they were confronted with the traditional problem of PMI, that PMI(0) equals negative infinity (a rather difficult number to factorize!); thus, they replace it with positive-PMI: PPMI(x) = max(0, PMI(x)).
Second, they use SVD to factorize this matrix, disregarding several fundamental aspects of the SGNS loss function.
Not surprisingly, this approach could not replicate SGNS.
The results in their paper show fundamental differences in performance on word similarity tasks between their algorithm and the original, showing that they did not actually correctly define the explicit matrix factorization.
Indeed, the “only” problems with this approach were (1) they did not use the shifted PMI matrix, and, (2) they did not use the correct loss function!In the Part 2 of this blog-post series we will discuss and provide the correct formulation of the explicit SGNS matrix factorization.
We will also dive into substantial detail on exactly what is the point-wise mutual information matrix, why this statistic is so important (and surprisingly ubiquitous) in word embedding algorithms, and how examining its mathematical properties can lead to many useful practical optimizations when implementing matrix factorization in a real-world setting.
] Mikolov, Tomas, et al.
“Distributed representations of words and phrases and their compositionality.
Note that this paper, which introduces SGNS, has over 10,000 citations, making it the single most influential paper on word embedding algorithms.
] Levy, Omer, Yoav Goldberg, and Ido Dagan.
“Improving distributional similarity with lessons learned from word embeddings.
” TAACL 3.
] Pennington, Jeffrey, Richard Socher, and Christopher Manning.
“Glove: Global vectors for word representation.
] Levy, Omer, and Yoav Goldberg.
“Neural word embedding as implicit matrix factorization.
Firth, John R.
“A synopsis of linguistic theory, 1930–1955.
” Studies in linguistic analysis.
.. More details