How to Improve Your Network Performance by Using Curriculum Learning

How to Improve Your Network Performance by Using Curriculum LearningVivianeBlockedUnblockFollowFollowingJan 7The idea of curriculum learning has already been proposed by Elman in 1993 and shown to improve network performance in several tasks by Bengio in 2009.

Still, you hear surprisingly little about it these days which is why I decided to write a post about this.

Curriculum learning describes a type of learning in which you first start out with only easy examples of a task and then gradually increase the task difficulty.

We humans have been learning according to this principle for decades, yet we don’t transfer it to neural networks and instead let them train on the whole data set with all its difficulties from the beginning on.

To demonstrate the effects of curriculum learning I will use a rather small convolutional neural network which tries to classify images into 10 categories.

The data set is called CIFAR-10 and can be found here: https://www.

cs.

toronto.

edu/~kriz/cifar.

html The network consists of two convolutional layers followed by a fully connected layer with 512 neurons and a read out layer with 10 neurons.

The network is too small to achieve a performance above 60% but this will make it easier to see how curriculum learning can improve the performance and how you can squeeze some more accuracy out of even a small network by using this technique.

Network performance over time on the CIFAR-10 data set.

Training with all classes at once for 5 epochs.

Red lines mark the beginning of a new epoch.

In the bottom right corner you can see some examples out of the data set.

When we now look at the network’s performance on the individual classes, we can see that some of the classes seem to be harder to learn than others.

While the network achieves quite a good performance on such objects as airplanes, cars, ships or trucks it struggles with cats, dogs, deer and birds.

Next, we want to see how the network performance changes if one doesn’t train with all classes from the beginning on but rather introduces them step by step.

In this experiment, I start out with six of the ten classes and then introduce one new class with each new epoch.

This means that after five epochs all ten classes are in the data set.

After that the network keeps training for another 20 epochs to reach a plateauing performance.

More on the progressive learning procedure and network growing can be found in my previous post.

When repeating this experiment many times, each time with a random order of the classes, one can observe some runs which perform especially well.

If you now look at these highest performing network trainings and plot the class orders used in the runs against the respective class difficulties one can observe a significant negative correlation of -0.

27 between the two, F(1,16.

43), p<0.

001.

As class difficulty, I define the network’s performance on the class at the end of a normal training (see the figure above).

One dot represents the difficulty of a class shown at the respective point in time during the best performing runs.

Class difficulty = the accuracy on this class at the end of a normal network training -> The higher the accuracy the lower the difficulty.

Position 0–5 are equal since the network starts out training on six classes.

The figure above shows how the runs with the best performing class orders tend to show the easy classes first and never show an easy class last.

All classes introduced in the last epoch have been classes on which the normal network struggled and which therefore seem harder to learn.

Now taking these results one step further one can compare the network performance when training with increasing difficulty to training with decreasing difficulty.

Test accuracy distribution over a hundred runs for training with increasing and decreasing class difficulty.

The results show a strongly significant difference between the two conditions where the networks trained with increasing difficulty have a lead of around 4% in accuracy.

The probably most interesting part is to compare these performances with the performance of a normal network which was trained on all classes from the beginning on.

Test accuracy distribution of 100 network trainings for continuous learning (increasing and decreasing difficulty) compared to a normal network training for an equal amount of epochs.

Even though the normal training has a slight advantage against the continuous learning since the latter has fewer epochs to train on some of the classes because they are only introduced later on, the network trained with gradually increasing difficulty reaches significantly higher performances, F(1,953.

43), p<0.

001.

It seems as if learning the broad concept on a few easy examples and only later on refining the concept with more complex examples gives the continuously learning network a distinct advantage over a network which needs to grasp the whole concept at once.

When thinking about real life this appears quite intuitive, one would not mix advanced calculus into a first grader’s math homework, but with neural networks this seems to be common practice.

Of course it is an additional effort to determine individual class difficulties, but for reaching or exceeding benchmarks and deploying an optimal model this additional effort caneasily be worth it.

Code: https://github.

com/vkakerbeck/Progressively-Growing-NetworksReferences:Bengio, Yoshua et al.

(2009).

“Curriculum learning”.

In: Proceedings of the 26th annual international conference on machine learning, pp.

41–48.

ISSN: 0022–5193.

DOI: 10 .

1145 / 1553374 .

1553380.

arXiv: arXiv : 1011 .

1669v3.

Elman, L (1993).

“Learning and development in neural networks : the importance of starting small”.

In: 48, pp.

71–99.

.. More details

Leave a Reply