How BERT leverage attention mechanism and transformer to learn word contextual relations

How BERT leverage attention mechanism and transformer to learn word contextual relationsIntroduction to BERTEdward MaBlockedUnblockFollowFollowingJan 5After ELMo (Embeddings from Language Model) and Open AI GPT (Generative Pre-trained Transformer), a new state-of-the-art NLP paper is released by Google.

They call this approach as BERT (Bidirectional Encoder Representations from Transformers).

Both Open AI GPT and BERT use transformer architecture to learn the text representations.

One of the difference is BERT use bidirectional transformer (both left-to-right and right-to-left direction) rather than dictional transformer (left-to-right direction).

On the other hand, both ELMo use bidirectional language model to learn the text representations.

However, ELMo use shallow concatenation layer while BERT use deep neural network.

After reading this post, you will understand:BERT Design and ArchitectureModel TrainingExperimentsImplementationTake AwayBERT Design and ArchitectureInput RepresentationBERT use three embeddings to compute the input representations.

They are token embeddings, segment embeddings and position embeddings.

“CLS” is the reserved token to represent the start of sequence while “SEP” separate segment (or sentence).

Those inputs areToken embeddings: general word embeddings.

In short, it uses vector to represent token (or word).

You can check out this story for detail.

Segment embeddings: sentence embeddings in another word.

If input includes 2 sentence, corresponding sentence embeddings will be assigned to particular words.

If input only include one sentence, one and only one sentence embeddings will be used.

Segment embeddings is learnt before computing BERT.

For sentence embeddings, you can check out this story for detail.

Position embeddings: Refer to the token sequence of input.

Even if there are 2 sentences, position will be accumulated.

BERT Input Representation (Devlin et al.

, 2018)Training TasksAfter talking about input representation, I will introduce how BERT is trained.

It uses two way to achieve it.

First training task is masked language model while the second task is predicting next sentence.

Masked Language ModelFirst pre-training tasks is leveraging masked language model (Masked LM).

Rather than traditional directional model, BERT use bidirectional as a pre-training objective.

If using traditional approach to train a bidirectional model, each word will able to see “itself” indirectly.

Therefore, BERT use Masked Language Model (MLM) approach.

By masking some tokens randomly, using other token to predicted those masked token to learn the representations.

Unlike other approaches, BERT predict masked token rather than entire input.

So the experiment pick 15% of token randomly to be replaced.

However, there are some downsides.

First disadvantage is that MASK token (actual token will be replaced by this token) will never seen in fine-tuning stage and actual prediction.

Therefore, Devlin et al, the selected token for masking will not alway be masked butA: 80% of time, it will be replaced by [MASK] tokenB: 10% of time, it will be replaced by other actual tokenC: 10% of time, it will be keep as original.

For example, the original sentence is “I am learning NLP”.

Assuming “NLP” is a selected token for masking.

Then 80% of time, it will show as “I am learning [MASK] (Scenario A).

“I am learning OpenCV” in 10% of time (Scenario B).

Rest of 10% of time, it will show as original which is “I am learning NLP” (Scenario C).

Although random replacement (Scenario B) occur and may harming the meaning of sentence.

But it is only 1.

5% (Only mask 15% of token out of entire data set and 10% of this 15%) indeed, authors believe that it will not harm the model.

Another downside is that only 15% token is masked (predicted) per batch, a longer time will take for training.

Next Sentence PredictionSecond pre-training task is going to predict next sentence.

This approach overcome the issue of first task as it cannot learn the relationship between sentences.

The objective is very simple.

Only classifying whether second sentence is next sentence or not.

For example,Input 1: I am learning NLP.

Input 2: NLG is part of NLP.

The expected output is either isNextSentence or notNextSentence.

When generating training data for this tasks, 50% of “notNextSentence” data will be randomly selected.

Model Training2 phases training is applied in BERT.

Using generic data set to perform first training and fine tuning it by providing domain specific data set.

Pre-training phaseIn pre-training phase, sentences are retrieved from BooksCorpus (800M words) (Zhu et al.

, 2015) and English Wikipedia (2500M words).

Masked LM: 512 tokens per sequence (2 concatenated sentences) will be used and there are 256 sequences per batch.

Approximate 40 epochs is set to train a model.

The configure is:Adam with learning rate of 1e-4, β1 = 0.

9, β2 = 0.

999L2 weight decay of 0.


1 dropout for all layersUsing gelu for activationAs described before, two sentences are selected for “next sentence prediction” pre-training task.

50% of time that another sentence is pickup randomly and marked as “notNextSentence” wile 50% of time that another sentence is actual next sentence.

This step done by Google research team and we can leverage this pre-trained model to further fine tuning model based on own data.

Fine-tuning phaseOnly some model hyperparameters are changed such as batch size, learning rate and number of training epochs, most mode hyperparameters are kept as same in pre-training phase.

During the experiments, the following range of value work well across tasks:Batch Size: 16, 32Learning Rate: 5e-5, 3e-5, 2e-5Number of epochs: 3, 4Fine-tuning procedure is different and it depends on downstream tasks.

ClassificationSingle Sentence Classification Task (Devlin et al.

, 2018)For [CLS] token, it will be feed as the final hidden state.

Label (C)probabilities are computed with a softmax.

After that it is fine-tuned to maximize the log-probability of the correct label.

Named Entity RecognitionNER Task (Devlin et al.

, 2018)Final hidden representation of token will be feed into the classification layer.

Surrounding words will be be considered on the prediction.

In other words, the classification only focus on the token itself and no Conditional Random Field (CRF).

ExperimentsPhoto by Louis Reed on UnsplashSo far, BERT deliver best result when comparing to other state-of-the-art NLP models.

Experiment Result on GLUE dataset (Devlin et al.

, 2018)Experiment Result on SQuAD (Devlin et al.

, 2018)ImplementationFine-tuning Model (Reproduce Experiment)Before fine-tuning domain specific dataset, I prefer to reproduce experiment result first.

You can visit the official page or following instruction for itExecute this script to download datasetDownload pre-trained model (Selected “BERT-Base, Uncased” model)Assign environment variableexport BERT_BASE_DIR=/downloaded_model_path/bertexport GLUE_DIR=/downloaded_data_path/glueexport BERT_OUTPUT_DIR=/trained/model/bert/Execute the following command to kick start the fine-tuningpython run_classifier.

py –task_name=MRPC –do_train=true –do_eval=true –data_dir=$GLUE_DIR/MRPC –vocab_file=$BERT_BASE_DIR/vocab.

txt –bert_config_file=$BERT_BASE_DIR/bert_config.

json –init_checkpoint=$BERT_BASE_DIR/bert_model.

ckpt –max_seq_length=128 –train_batch_size=32 –learning_rate=2e-5 –num_train_epochs=3.

0 –output_dir=$BERT_OUTPUT_DIRI used a 20 core CPUs machine to reproduce it and spending around one hour to finish the fine-tuning.

INFO:tensorflow:***** Eval results *****INFO:tensorflow: eval_accuracy = 0.

84313726INFO:tensorflow: eval_loss = 0.

5097478INFO:tensorflow: global_step = 343INFO:tensorflow: loss = 0.

5097478Extract Fixed VectorsOther than fine-tuning pre-trained model for specific dataset.

We can also extract a fixed vectors for downstream tasks which is easier.

It is similar to what ELMo did.

You can visit the official page or following instruction for itGenerate a sample file to current directionaryecho 'Who was Jim Henson?.||| Jim Henson was a puppeteer' > input.

txtExecute the following command to extract character vectorspython extract_features.

py –input_file=input.

txt –output_file=$BERT_OUTPUT_DIR/output.

jsonl –vocab_file=$BERT_BASE_DIR/vocab.

txt –bert_config_file=$BERT_BASE_DIR/bert_config.

json –init_checkpoint=$BERT_BASE_DIR/bert_model.

ckpt –layers=-1,-2,-3,-4 –max_seq_length=128 –batch_size=8Output file contains the following object- features – token: Token value (e.


Who) – layers – index: Layer number (from -1 to -4) per token – values: Vector values.

Default model dimension is 768ParameterIt will be useful if we understand more about how we can change the parameters.

Here is some useful parameter explanation:data_dir: Data directionarytask_name: Specific what task do you use.

Specific tasks processors are ready for use.

Possible task_name are “cola”, “mnli”, “mrpc” and “xnli”.

You can implement your own data processor by extending DataProcessor class.

do_train: Include training step.

Any one of do_train , do_eval or do_test have to been enabled.

do_eval: Include evaluation step.

Any one of do_train , do_eval or do_test have to been enabled.

do_test: Include test step.

Any one of do_train , do_eval or do_test have to been enabled.

About MeI am Data Scientist in Bay Area.

Focusing on state-of-the-art in Data Science, Artificial Intelligence , especially in NLP and platform related.

You can reach me from Medium Blog, LinkedIn or Github.

ReferenceDevlin J.

, Chang M.


, Lee K.

, Toutanova K.

, 2018.

BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT in Tensorflow (Original)BERT in PyTorchBERT in chainerword2vec, glove and fastText Story (Word Embeddings)Skip-Thoughts Story (Sentence Embeddings)ELMo StoryNER Story.

. More details

Leave a Reply