Deciding optimal kernel size for CNN

Deciding optimal kernel size for CNNConvolutional Neural Networks (CNNs) are neural networks that automatically extract useful features (without manual hand-tuning) from data-points like images to solve some given task like image classification or object detection. And now that you understand their use on your datasets, you start wondering: Apart from tuning various hyper-parameters of your network, how do I know what is the right kernel size for the network? Let’s dig further!Let us set some common ground rules to stay on the same platform throughout the discussion:We will be looking primarily at 2D convolutions on images. These concepts also apply for 1D and 3D convolutions, but may not correlate directly.A 2D convolution filter like 3×3 will always have a third dimension in size. The third dimension is equal to the number of channels of the input image. For example, we apply a 3x3x1 convolution filter on gray-scale images (that has 1 black and white channel) whereas, we apply a 3x3x3 convolution filter on a colored image (with 3 channels, red, blue and green).We are assuming zero padding for rest of the discussion.In a convolution, a convolution filter slides over all the pixels of the image taking their dot product. We do this hoping that the linear combination of the pixels weighted by the convolutional filter extracts some kind of feature from the image. It is done keeping these things in mind:Most of the useful features in an image are usually local and it makes sense to take few local pixels at a time to apply convolutions.Most of these useful features may be found in more than one place in an image. So, it makes sense to slide a single kernel all over the image in the hope of extracting that feature in different parts of the image using the same kernel.Also, an added benefit of using a small kernel instead of a fully connected network is to benefit from weight sharing and reduction in computational costs. To briefly explain this point, since we use the same kernel for different set of pixels in an image, the same weights are shared across these pixel sets as we convolve on them. And as the number of weights are less than a fully connected layer, we have lesser weights to back-propagate on.Now that we have convolution filter size as one of the hyper-parameters to choose from, a choice needs to be made between smaller or larger filter size. Let us quickly compare both to choose the optimal filter size:Comparing smaller and larger convolutional kernel sizes theoretically.Now that we have some idea about the extraction using different sizes we will follow this up with a convolution example for small (3×3) and large filter sizes (5×5):Comparing smaller and larger convolutional kernel sizes using a 3×3 and a 5×5 example.Based on the comparison above, we can conclude that smaller kernel sizes are and should be a popular choice over larger sizes.Also, you might notice a preference for odd number as kernel size over a 2×2 or 4×4 kernel size. The explanation is as follows:For an odd-sized filter, all the previous layer pixels would be symmetrically around the output pixel. Without this symmetry, we will have to account for distortions across the layers which happens when using an even sized kernel. Therefore, even sized kernel filters are mostly skipped to promote implementation simplicity. If you think of convolution as an interpolation from the given pixels to a center pixel, we cannot interpolate to a center pixel using an even-sized filter.Therefore, in general, we would like to use smaller odd-sized kernel filters. But, 1×1 is eliminated from the list candidate optimal filter sizes as the features extracted would be fine grained and local, with no information from the neighboring pixels. Also, it is not really doing any useful feature extraction!Hence, 3×3 convolution filters work in general, and is often the popular choice!Originally published at icecreamlabs.com on August 19, 2018.

Leave a Reply