In this article I will explore some recent advances in few-shot learning through a deep dive into three cutting-edge papers:Matching Networks: A differentiable nearest-neighbours classifierPrototypical Networks: Learning prototypical representationsModel-agnostic Meta-Learning: Learning to fine-tuneI will start with a brief explanation of n-shot, k-way classification tasks which are the de-facto benchmark for few-shot learning.I’ve reproduced the main results of these papers in a single Github repository..The query sample is in top-center.Matching NetworksVinyals et al.While there is much previous research on few-shot approaches for deep learning, Matching Networks was the first to both train and test on n-shot, k-way tasks..However, Matching Networks combine both embedding and classification to form an end-to-end differentiable nearest neighbours classifier.Matching Networks first embed a high dimensional sample into a low dimensional space and then perform a generalised form of nearest-neighbours classification described by the equation below.Equation (1) from Matching NetworksThe meaning of this is that the prediction of the model, y^, is the weighted sum of the labels, y_i, of the support set, where the weights are a pairwise similarity function, a(x^, x_i), between the query example, x^, and a support set samples, x_i..The embedding function they use for their few-shot image classification problems is a CNN which is, of course, differentiable hence making the attention and Matching Networks fully differentiable!.This means its straightforward to fit the whole model end-to-end with typical methods such as stochastic gradient descent.Attention function used in the Matching Networks paperIn the above equation c represents the cosine similarity and the the functions f and g are the embedding functions for the query and support set samples respectively..Another interpretation of this equation is that the support set is a form of memory and upon seeing a new samples the network generates a prediction by retrieving the labels of samples with similar content from this memory.Interestingly the possibility for the support set and query set embedding functions, f and g, to be different is left open in order to grant more flexibility to the model..do exactly this and introduce the concept of full context embeddings or FCE for short.They consider the myopic nature of the embedding functions a weakness in the sense that each element of the support set x_i gets embedded by g(x_i) in a fashion that is independent of the rest of the support set and the query sample..we are performing fine-grained classification between dog breeds, we should change the way the samples are embedded to increase the distinguishability of these samples.In practice the authors use an LSTM to calculate the FCE of the support and then use another LSTM with attention to modify the embedding of the query sample..This results in an appreciable performance boost at the cost of introducing a bunch more computation and a slightly unappealing arbitrary ordering of the support set.All in all this is a very novel paper that develops the idea of a fully differentiable neural neighbours algorithm.Prototypical NetworksClass prototypes c_i and query sample x.In Prototypical Networks Snell et al..Gradient L_i are the losses for tasks, i, in a meta-batch and the starred theta_i are the optimal weights for each task.MAML does not learn on batches of samples like most deep learning algorithms but batches of tasks AKA meta-batches..Alpha is a learning rate hyperparameter.After the parameter update we sample some more, unseen, samples from the same task and calculate the loss on the task of the updated weights (AKA fast model) of the meta-learner.. More details