Let’s assume that pre-training takes 400 days on one 1080Ti and let’s work from there:Starting from pre-trained vectors / n-grams — maybe x10 faster? ;Not using a large softmax layer (even if it is linked to your embedding layer) but using cosine loss or something inspired by this.
By the way these guys also start from FastText — x2 faster?;A lighter model — x4 faster?;Using an embedding bag layer that works well for Russian language;All in all with all of these “optimizations” it seems feasible to be able to pre-train / tune a transformer in a week or so) And it is real, the only problem is that the actual pre-trained model did not really seem to beat a model just initilized with FastText.
Pre-traininig experiments* We used 2 GPU setup for each model, but in the end we found out that the newer version of the embedding bag was roughly 25% slower + due to large embedding bag size;** Classification task from BERT paper;Other “failed” approached we tested:All models trained from scratch converged much slower and plateaued quickly;All BPE based models initialized with FastText converged much slower and plateaued quickly around 65% sequential task accuracy;FastText + embedding freeze — minus 5pp sequential task accuracy;L2 embedding lossCosine embedding lossActually trying out the pre-trained modelThis was by far the most disappointing part of this whole exercise.
As mentioned in the intro — any sort of transformer (from scratch, pre-trained, from FastText) did not help in our “easy” classification task on a complex domain (but FastText was the best).
On a challenging SberSQUAD task, we has the following results:A FastText initialized model trained with a high lr of 1e-3 to about 37%-40% EM.
Probably more can be achieved with LR decay.
Remarkably model diverged frequently and seemed to “jump” on each restart;When we tried the pre-trained model with high lr of 1e-3 it trained much faster than FastText, but overfitted heavily;If we started with lower lr somewhere around 5e-4 – then the pre-trained model traned also much faster than FastText, but overfitted around 30% EM;I suppose if we invested x10 resources into actually tuning the hyper-parameters, then we would achieve a higher result.
But you see — generative pre-training IS NOT A SILVER BULLET.
especially for non generative tasks.
On any SANE task — conventional RNNs / CNNs / TCNs — blow transformers out of the water.
Top performance of FastText initialized transformerSome comparisonsLow learning rate, pre-train vs.
fast-textEmbedding bag codeJust use our code, stick it here and add water.
No, we are not stupid, and we use version control.
Improvements or how to make our idea mainstreamPeople from OpenAI, Google and FAIR, if you are reading this, you can do the following:Solve attention problem within the Embedding bag layer;Add more compute to train a larger transformer with larger embedding bags;Test such generative pre-training on other benchmarks for morphologically rich languages if you have them;Invest time and effort into proper sub-word splitting techniques and pass bags corresponding to different kinds of sub-words separately;ReferencesPopular toolsTemporal convolutional network;Popular BPE implementation;Auto-Encoding Dictionary Definitions into Consistent Word Embeddings;PyTorch BERT by Huggingface;Improved English to Russian Translation by Neural Suffix Prediction;BERT pre-training;Corpuses / datasets / benchmarks in Russian:Russian SQUAD and sentiment analysis datasets;Mining for a large web corpus in Russian;My posts on parsing Wikipedia, and parsing Common Crawl;Prepared and deduplicated Common Crawl texts;Downsides of using common crawl to train sentence encoders;DeepPavlov Russian SQUAD;FastText pre-trained on the largest corpus in Russian;Simple sentence embedding baselines:http://nlp.
net/pdf?id=SyK00v5xx;Word embeddings explained:http://www.
org/2016/02/14/word-embeddings-2/;Original word embedding papers:Word2Vec — Distributed Representations of Words and Phrasesand their Compositionality;FastText Enriching Word Vectors with Subword InformationIllustrated state-of-the-art NLP models and approaches:Attention;Illustrated transformer;Annotated transformer;Annotated encoder decoder;Plain self-attention in PyTorch;A couple of notes on TCNs and self-attention;Training NMT models several times faster w/o a large softmax layer?;Other linksRussian word parts;CFT 2018 competition;Originally published at spark-in.
me on March 1, 2019.