MoE’s power stems from the fact that each expert specializes in a different segment of the input space with a unique mapping ????→????.
If we use the mapping ????→????, each expert will specialize in a different segment of the input space with unique patterns in the input itself.
We’ll use VAEs as the experts.
Part of the VAE’s loss is the reconstruction loss, where the VAE tries to reconstruct the original input image ????:MoE architecture where the experts are implemented as VAEA cool byproduct of this architecture is that the manager can classify the digit found in an image using its output vector!One thing we need to be careful about when training this model is that the manager could easily degenerate into outputting a constant vector — regardless of the input in hand.
This results in one VAE specialized in all digits, and nine VAEs specialized in nothing.
One way to mitigate it, which is described in the MoE paper, is to add a balancing term to the loss.
It encourages the outputs of the manager over a batch of inputs to be balanced:Enough talking — It’s training time!Images generated by the experts.
Each column belongs to a different expert.
In the last figure we see what each expert has learned.
After each epoch we used the experts to generate images from the distributions they specialized in.
The i’th column contains the images generated by the i’th expert.
We can see that some of the experts easily managed to specialize in a single digit, e.
Some got a bit confused by similar digits, such as the expert that specialized in both 3 and 5.
An expert specializing in 2What else?Using a simple model without a lot of tuning and tweaking, we got reasonable results.
Optimally, we would want each expert to specialize in exactly one digit, thus achieving a perfect unsupervised classification via the output of the manager.
Another interesting experiment would be to turn each expert into a MoE of its own!.It will allow us to learn hierarchical parameters by which VAEs should specialize.
For instance, some of the digits have multiple ways to be drawn: 7 can be drawn with or without a strikethrough line.
This source of variation could be modeled by the MoE in the second level of the hierarchy.
But I’ll leave something for a future post…Originally published by me at anotherdatum.
.. More details