A couple of points to keep in mind: We generally use a pooling layer to shrink the height and width of the image To reduce the number of channels from an image, we convolve it using a 1 X 1 filter (hence reducing the computation cost as well) The Motivation Behind Inception Networks While designing a convolutional neural network, we have to decide the filter size..Should it be a 1 X 1 filter, or a 3 X 3 filter, or a 5 X 5?.Inception does all of that for us!.Let’s see how it works..Suppose we have a 28 X 28 X 192 input volume..Instead of choosing what filter size to use, or whether to use convolution layer or pooling layer, inception uses all of them and stacks all the outputs: A good question to ask here – why are we using all these filters instead of using just a single filter size, say 5 X 5?.Let’s look at how many computations would arise if we would have used only a 5 X 5 filter on our input: Number of multiplies = 28 * 28 * 32 * 5 * 5 * 192 = 120 million!.Can you imagine how expensive performing all of these will be?.Now, let’s look at the computations a 1 X 1 convolution and then a 5 X 5 convolution will give us: Number of multiplies for first convolution = 28 * 28 * 16 * 1 * 1 * 192 = 2.4 million Number of multiplies for second convolution = 28 * 28 * 32 * 5 * 5 * 16 = 10 million Total number of multiplies = 12.4 million A significant reduction..This is the key idea behind inception.. Inception Networks This is how an inception block looks: We stack all the outputs together..Also, we apply a 1 X 1 convolution before applying 3 X 3 and 5 X 5 convolutions in order to reduce the computations.. More details