Multi-Task Learning in Language Model for Text ClassificationUniversal Language Model Fine-tuning for Text ClassificationEdward MaBlockedUnblockFollowFollowingMar 27Howard and Ruder propose a new method to enable robust transfer learning for any NLP task by using pre-training embedding, LM fine-tuning and classification fine-tuning.
The sample 3-layer of LSTM architecture with same hyperparameters except different dropout demonstrate a outperform and robust model for 6 downstream NLPS tasks.
They named it as Universal Language Model Fine-tuning (ULMFiT).
This story will discuss about Universal Language Model Fine-tuning for Text Classification (Howard and Ruder, 2018) and the following are will be covered:ArchitectureExperimentImplementationArchitectureAs mentioned before ULMFiT has 3 stages.
The first stage is using general domain data to build a LM pre-training model.
Second stage is fine-tuning LM for target data set while third stage is fine-tuning classification for target data set.
The Language Model (LM) is Averaged-Stochastic Gradient Descent Weight-Dropped Long-short-term-memory (AWD-LSTM).
It does not use transformer but using a regular LSTM with various tuned dropout hyperparameters.
General-domain LM pretrainingIt is domain-free problem so we can leverage any data to train the model.
In other word, data volume can be very very large.
For example, it can use whole content of wikipedia or reddit content.
The purpose of that is capturing general features to handle different kind of downstream problems.
Significant improvement from transfer learning in NLP is demonstrated by lots of previous experiment.
Target task LM fine-tuningHaving a general purpose vectors, it may not perform well directly on specific problem as it is too generic.
Therefore, fine-tuning is a must action.
First of all, model will be fine-tuned by Language Model (LM) problem.
Theoretically, it will convergence much faster than general-domain LM training because it only needs to learn the characterise of single source of target tasks.
Some tricks are used to boost up the performance of ULMFiT.
They are discriminative fine-tuning and slanted triangular learning rates.
DDiscriminative fine-tuning is proposed to use different learning rate for different layers.
From the experiment, Howard and Ruder found that choosing the learning rate for last layer only.
The learning rate of last 2 layer is last layer / 2.
6 and using same formula to setup the learning rate for lower layers.
Stochastic Gradient Descent (SGD).
η is learning rate while ∇θJ(θ) is the gradient of objective function.
(Howard and Ruder, 2018)SGD with discriminate fine-tuning.
η l is learning rate of l-th layer.
(Howard and Ruder, 2018)Slanted triangular learning rates (STLR) is another approach of using dynamic learning rate is increasing linearly at the beginning and decaying it linearly such that it formed a triangle.
T is number of training iteration.
cut_frac is the fraction of increasing learning rate.
cut is the iteration switching from increasing to decreasing.
p is the fraction of the number of iterations which increase or decreased.
(Howard and Ruder, 2018)Slanted triangular learning rate schedule used for ULMFiT (Howard and Ruder, 2018)Target task classifier fine-tuningThe parameters in this classifier layer are learned from scratch.
Since strong signal can exist anywhere but not limited to last word in the sequence, Howard and Ruder proposed to concatenate all layers by max-polling and mean-polling.
hT is the hidden state at the last time step.
(Howard and Ruder, 2018)Besides concatenating last hidden state with max-pooling and mean-polling, some tricks are applied in this stage to boost up the performance.
Tricks are gradual unfreezing, BPTT for Text Classification (BPT3C) and bidirectional language mode.
Gradual unfreezing is that all layers except last layer are frozen in the first epoch and only the last layer is fine-tuned.
In the next epoch, last frozen layer will be unfrozen and fine-tuning all unfrozen layer.
More and more layer will be unfrozen in coming epoch.
BPTT for Text Classification (BPT3C) enable gradient propagation if the input sequence is large.
A large input sequence (let says a document) will be divided into fixed-length batches.
The initial hidden state of each batch is final state of the previous batch.
As mentioned before, max-pooling and mean-polling are tracked and back-propagated.
Bidirectional language model is leveraging both forward and backward LM to learn feature from inputs.
ExperimentFrom IMDb, TREC-6 and AG dataset, you can noticed that it gained a great benefit if target dataset is very small.
Howard and Ruder proposed many tricks to boost up performance.
Below figure demonstrated the result of tricks combination.
Full: Fine-tuning full modeldiscr: Apply discriminative fine-tuning.
Last: Only fine-tuned last layerChain-thaw: BPTT for Text ClassificationFreez: Gradual unfreezingstlr: Slanted triangular learning ratescos: Aggressive cosine annealing scheduleComparison result among different tracks.
(Howard and Ruder, 2018)IMDb Comparison Result (Howard and Ruder, 2018)AG, DBpedia, Yelp-bi and Yelp-full Comparison Result (Howard and Ruder, 2018)Implementationfast.
ai provides a sample code from github.
Take AwayThis paper proposed several novel fine-tuning techniques.
Demonstrated the capability of transfer learning in NLP.
About MeI am Data Scientist in Bay Area.
Focusing on state-of-the-art in Data Science, Artificial Intelligence , especially in NLP and platform related.
Feel free to connect with me on LinkedIn or following me on Medium or Github.
I am offering short advise on machine learning problem or data science platform for small fee.
Howard and S.
Universal Language Model Fine-tuning for Text Classification.