Combining supervised learning and unsupervised learning to improve word vectorsIntroduction to Generative Pre-TrainingEdward MaBlockedUnblockFollowFollowingJan 20To achieve state-of-the-art result in NLP tasks, researchers try tremendous way to let machine understand language and solving downstream tasks such as textual entailment, semantic classification.
OpenAI released a new model which named as Generative Pre-Training (GPT).
After reading this article, you will understand:Finetuned Transformer LM DesignArchitectureExperimentsImplementationTake AwayFinetuned Transformer LM DesignThis approach includes 2 steps.
First of all, model is trained via unsupervised learning based-on a vast amount of data.
Second part is using a target data set (domain data) to fine-tune the model from previous step via supervised learning.
Unsupervised LearningThere is no denying that there are unlimited unlabeled data for NLP.
Radford et al.
believe that leveraging unlimited corpus help to train a model for general purpose just like word2vec (word embeddings) and skip-thought (sentence embeddings).
We do not need consider about the volume of training data because we can easily get a lot of corpus.
However, there is still have a limitation.
Although we can use as much as corpus we can, it is disconnected with our domain data in most of time.
From my previous work, I noticed most of word in my domain data does not exist in lots of off-the-shelf word embeddings model.
In stead of using RNN architecture, Radford et al.
applies transformer architecture to train the first model.
Because they believe that transformer architecture is able capture a longer range signal (language characteristics).
The limitation is high computation time.
Since Radford et al.
use 12 self-attention block and high dimensional inner states, it takes several weeks to train the initial model even using GPU.
Supervised LearningAfter that, target data set (it should be a small data set by comparing to the previous dataset in most of time) will be leveraged to fine-tune the model via supervised learning.
ArchitectureUnsupervised Learning Model (Transformer)Sequence-to-sequence (aka RNN) model has a limitation which we need to define the fixed-length context vector and it hurt the capability of memorize a very long sentences.
Meanwhile, attention mechanism was born to overcome this issue.
The architecture is called as “transformer” which is multi-headed self-attention.
In the family of attention mechanism, we have lots of variant of attention and Radford et al.
decide to use self-attention.
Multi-head refer to using multiple self-attention with different parameter to compute a representation.
Think about that, we want to have multiple expert to help on finding a better result.
The multi-head mechanism execute same computation with difference parameter in parallel.
Computed result from different attention block will be concatenated and transforming to desired dimensions.
Photo by JOSHUA COLEMAN on UnsplashAttention is leveraging the CNN advantage .
Self-attention does not depends on previous information such that it can be run in parallel to achieve a lower computational time.
Meanwhile, word is calculated all defined word directly instead of surroundings only.
It overcomes the RNN disadvantage which is lack of capability of memorize a very long sentences.
Self-attention is one of the member of attention mechanism family.
Input of attention are Q (query), K (key) and V (value).
Different from other member, all input (Q, K and V) are equal.
Photo by Ari Spada on UnsplashFor detail of transformer, you may check out this paper.
Go back to the architecture, input features are text and position of text to compute a word vectors.
Position of text refers to word position of input.
The flow is:Text and position will be transformed to a vectorsPass to multi-head self-attentionCombining result from step 1 and step 2 and performing a normalizationPass to a fully-connected feed-forward networkCombining result from step 3 and 4 and performing a normalizationFinally, combing multi-head (total 12 self-attention block) to together for computing vectors.
Transformer architecture (Radford et al.
, 2018)Model specifications are:12 transformers in total768 dimensional states in self-attention3072 dimensional inner states in position-wise feed-forward networks.
Use Adam optimization with a maximum learning rate of 2.
5e-4100 epochs on 64 mini batchesDropout rate is 0.
1Supervised LearningAfter training a model from previous step, this supervised fine-tuning process help to obtain vectors for target tasks.
Assuming input is a sequence of input tokens with label, we can get a token’s vectors from pre-trained model.
Input Transformations for fine-tuning on different tasks (Radford et al.
, 2018)ExperimentsExperimental Result on Natural Language Inference Tasks ( Radford et al.
, 2018)Experimental Result on Question Answering and Commonsense Reasoning Tasks ( Radford et al.
, 2018)Take AwayDemonstrated the capability of fine-tuning for specific domain data.
Design of BERT is similar to this model while BERT further improve the limitation of this model.
Author noticed there is no further enhancement on this architecture design.
(Mentioned in github)About MeI am Data Scientist in Bay Area.
Focusing on state-of-the-art in Data Science, Artificial Intelligence , especially in NLP and platform related.
You can reach me from Medium Blog, LinkedIn or Github.
, Narasimhan K.
, Salimans Tim.
, Sutskever I.
Improving Language Understanding by Generative Pre-Training.
, Shazeer N.
, Parmar N.
, Uszkoreit J.
, Jones L.
, Gomez A.
, Kaiser L.
Attention is all you need.
Finetuned Transformer LM in tensorflow (Original)Finetuned Transformer LM in pytorch.