An introduction to high-dimensional hyper-parameter tuning

As not intuitive as it might seem, this idea is almost always better than Grid Search.

A little bit of intuitionNote that some of the hyper-parameters are more important than others.

The learning rate and the momentum factor, for example, are more worth tuning than all others.

However, with the above exception, it is hard to know which ones play major roles in the optimization process.

In fact, I would argue that the importance of each parameter might change for different model architectures and datasets.

Suppose we are optimizing over two hyper-parameters — the learning rate and the regularization strength.

Also, take into consideration that only the learning rate matters for the problem.

In the case of Grid Search, we are going to run nine different experiments, but only try three candidates for the learning rate.

Image Credit: Random Search for Hyper-Parameter Optimization, James Bergstra, Yoshua Bengio.

Now, take a look at what happens if we sample the candidates uniformly at random.

In this scenario, we are actually exploring nine different values for each parameter.

If you are not yet convinced, suppose we are optimizing over three hyper-parameters.

For example, the learning rate, the regularization strength, and momentum.

Optimizing over 3 hyper-parameters using Grid Search.

For Grid Search, we would be running 125 training runs, but only exploring five different values of each parameter.

On the other hand, with Random Search, we would be exploring 125 different values of each.

How to do itIf we want to try values for the learning rate, say within the range of 0.

1 to 0.

0001, we do:Note that we are sampling values from a uniform distribution on a log scale.

You can think of the values -1 and -4 (for the learning rate) as the exponents in the interval [10e-1, 10e-4].

If we do not use a log-scale, the sampling will not be uniform within the given range.

In other words, you should not attempt to sample values like:In this situation, most of the values would not be sampled from a ‘valid’ region.

Actually, considering the learning rate samples in this example, 72% of the values would fall in the interval [0.

02, 0.

1].

Moreover, 88% in the sampled values would come from the interval [0.

01, 0.

1].

That is, only 12% of the learning rate candidates, 3 values, would be sampled from the interval [0.

0004, 0.

01].

Do not do that.

In the graphic below, we are sampling 25 random values from the range [0.

1,0.

0004].

The plot in the top left shows the original values.

In the top right, notice that 72% of the sampled values are in the interval [0.

02, 0.

1].

88% of the values lie within the range [0.

01, 0.

1].

The bottom plot shows the distribution of values.

Only 12% of the values are in the interval [0.

0004, 0.

01].

To solve this problem, sample the values from a uniform distribution in a log-scale.

A similar behavior would happen with the regularization parameter.

Also, note that like with Grid Search, you need to consider the two cases we mentioned above.

If the best candidate falls very near the edge, your range might be off and should be shifted and re-sampled.

Also, after choosing the first good candidates, try re-sampling to a finer range of values.

In conclusion, these are the key takeaways.

If you have more than two or three hyper-parameters to tune, prefer Random Search.

It is faster/easier to implement and converges faster than Grid Search.

Use an appropriate scale to pick your values.

Sample from a uniform distribution in a log-space.

This will allow you to sample values equally distributed across the parameters ranges.

Regardless of Random or Grid Search, pay attention to the candidates you choose.

Make sure the parameter’s ranges are properly set and refine the best candidates if possible.

Thanks for reading!.For more cool stuff on Deep Learning, check out some of my previous articles:How to train your own FaceID ConvNet using TensorFlow Eager executionFaces are everywhere — from photos and videos on social media websites, to consumer security applications like the…medium.

freecodecamp.

orgMachine Learning 101: An Intuitive Introduction to Gradient DescentGradient descent is, with no doubt, the heart and soul of most Machine Learning (ML) algorithms.

I definitely believe…towardsdatascience.

com.

. More details

Leave a Reply