This is because they come from a distribution different than that of the general population.
They have specific characteristics — maybe they performed poorly online, which caused Taboola’s recommender system to recommend them less, and in turn — they became rare in the dataset.
So why does this distribution difference matter?If we learn the OOV embedding using this special distribution, it won’t generalize to the general population.
Think about it this way — every item was a new item at some point.
At that point, it was injected with the OOV token.
So the OOV embedding should perform well for all possible items.
Randomness is the data scientist’s best friendIn order to learn the OOV embedding using the general population, we can inject the OOV token to a random set of examples from the dataset before we start the training process.
But how many examples will suffice?The more we sample, the better the OOV embedding will be.
But at the same time, the model will be exposed to a fewer number of non-OOV values, so the performance will degrade.
How can we use lots of examples to train the OOV embedding while at the same time use the same examples to train the non-OOV embeddings?.Instead of randomly injecting the OOV token before starting to train, we chose the following approach: in each epoch the model trains using all of the available values (the OOV token isn’t injected).
At the end of the epoch we sample a random set of examples, inject the OOV token, and train the model once again.
This way, we enjoy both worlds!To evaluate the new approach, we injected the OOV token to all of the examples and evaluated our offline metric (MSE).
It improved by 15% compared to randomly injecting the OOV token before the model starts to train.
Final thoughtsOur model had been used in production for a long time before we thought of the new approach.
It could have been easy to miss this potential performance gain, since the model performed well overall.
It just stresses the fact that you always have to look for the unexpected!Originally published by me at engineering.