How did we evolve to become noise robust agents?Invariant RepresentationsPerhaps the answer lies in how we represent data.
One possible path to noise-robustness is noise-invariance: an agent or a model should be able to discard noise (that’s not beneficial for a task) such that there is no difference between the internal representation for a clean and noisy signal.
Specifically a clean example, x, and its superficially perturbed counterparts, v(x), shouldn’t merely map to the same class — they should map to the same representation.
a clean example, x, and its superficially perturbed counterparts, v(x), shouldn’t merely map to the same class — they should map to the same representationUnderstanding that vectors have both a magnitude and direction, one possible proposition is to jointly supervise on the L2 and Cosine distances of the internal hidden layer representations between x and v(x).
We’ll call this the Invariant Representation Learning (IRL) loss.
The full IRL loss penalizes the classification error for the clean sample and the noised sample as well as the distance between their representations.
The Invariant Representation Learning algorithm is thus,At each training iteration, for each training example, we sample a noisy counterpart.
Noise sampling can occur online.
For example, we may choose to apply stochastic lighting distortions to an image or add some background noise to an audio file.
We then apply a penalty term to coerce matched representations at each layer (above some chosen layer).
We train jointly with the original task.
Applications to Speech RecognitionWe take a sequence-to-sequence with attention model and train it on the LibriSpeech corpus (a speech recognition task on audio books).
Supervising on the IRL loss from the encoder output layer onwards, we find that our method clearly outperforms existing baselines (as well as vanilla data augmentation) on the clean LibriSpeech dataset.
More surprisingly, our model is able to generalize well to out-of-domain noise types (noises that we’ve never seen during training).
In search of a qualitative explanation, we aggregate and plot the average distance between all clean examples, x, and their sampled noisy counterparts, v(x).
The result is intriguing.
We find that there is a large drop in the representation distances for each layer once we apply data augmentation (with noisy examples) on top of normal training.
Other methods such as logit pairing and adversarially training against a zero-one noise discriminator seem to lower representation distances even more.
Our algorithm provides the lowest representation distances between x and v(x) while also keeping the representations close, inhibiting divergence at later layers.
ConclusionsIn a world full of noise, it becomes necessary that our algorithms are noise-robust.
IRL presents a simple-to-implement, domain-agnostic method for training robust networks.
Notably, IRL improves accuracy against both clean and out-of-domain noisy data without any impact on inference throughput.
For more information, please consult the original paper (presented at IEEE Spoken Language Technologies 2018) here.
.. More details