and what is the problem?Patch based CNN's usually applied on single patches of an image, where each patch is classified separately.
This approach is often used when trying to execute the same CNN several times on neighboring, overlapping patches in an image.
This includes tasks based feature extraction like camera calibration, Patch matching , optical flow estimation and stereo matching.
In addition there are patch based applications not considered as feature extraction like sliding window object detection or recognition.
In all such patch based tasks there can be a lot of redundancy between the computations of neighboring CNN's.
For example, Look at the figure below.
On the left you can see a simple 1-dimensional CNN.
Starting from the bottom, each pixel contributes to only one result in the output layer without any redundancy.
On the contrary, on the right, if this CNN is executed at each pixel position of an image to create features, many intermediate layer results are shared between networks for no reason.
The numbers in nodes state how often a node is shared.
The red connections show how the red node is shared.
Pooling with stride 2 halves the output resolution.
Thus, we need two pooling layers: the original one (blue) and one shifted by one pixel (green) to avoid halving the output resolution.
Fast Dense Feature ExtractionThe main idea of this approach is, instead of executing our patch based CNN Cp (which was trained on training patches P) separately for each patch in the image, let us efficiently execute it on all patches P(x, y) in the input image I at once.
For the sake of consistency, let us define the input image I with width Iw and height Ih, we can define patches P(x, y) with width Pw and height Ph centered at each pixel position (x, y), x ∈ 0…Iw −1, y ∈ 0…Ih −1 in the input image I.
The output vector O(x, y) = CP(P(x, y)) is a k channel vector which belongs to the (Ih,Iw, k) dimensional output matrix O that contains the results of Cp executed on all image patches P(x, y).
To do this, we can create a network CI that directly calculates O from I, while avoiding the redundancy that occurs when Cp is executed on each image patch independently.
The architectural differences between Cp and CI are shown in the image below.
Here all pooling layers in the feature extractor are replaced with multi-pooling layersArchitecture of Cp (Left) and CI (Right)It worth to mention that CI would give the same result as executing the network Cp on every patch of the image I, independently.
However, CI runs much faster as it avoids redundancy between overlapping patches.
Let us examine the steps necessary to get from Cp to CI when we deal with types of layers — mainly, ordinary layers (without pooling or striding) and abnormal layers (includes pooling or striding).
Ordinary LayersWith no striding or pooling the layers of Cp and CI are identical i.
This is because their output does not depend on the spacial position of the input, but only on the input values itself.
Abnormal Layers (includes pooling or striding)In contrast to ordinary layers striding and pooling layers must be handled explicitly.
The image below visualizes the main issue of pooling: The first patch P(x, y) requires different 2 x 2 pooling (blue) than the second patch P(x+1, y) (green) and can thus not share pooling outputs.
However, the patch P(x+2, y) can work with the original pooling again(blue).
Overlapping positions of P(x, y) and P(x + 2, y) provide identical results and can thus be shared (bright yellow).
A generalization for this example would be with s being the pooling/stride size and u and v being integers the patches P(x, y) and P(x+su, y+sv) still share pooling outputs for pixels that are shared by both patches.
Patches P at different image positions (in red).
Sharing between patches that are using blue and the ones that are using green pooling is not possibleThis creates all together s × s different pooling situations that have to be computed independently on the input I’ of our pooling layer, where I’ is the Input image for the l-th layer.
As a s×s pooling layer reduces the output size to Iw/s,Ih/s (with input size Iw,Ih) it is clear that s × s such outputs are required to still obtain an output O of spacial size Iw,Ih.
The different pooling outputs are stacked in an extra output dimension noted as M.
all different pooling outputs noted as channels would now get treated as independent samples by subsequent layers (similar to a batch dimension).
The animation above gives a better intuition on how the process is done, per channel a pooling is performed finally to be stacked in M.
UnwarpingWith one multi-pooling layer, we get an output W with dimensions W = (M = s×s, Ih/s,Iw/s, k), which we want to unwarp to the final output O = (Ih,Iw, k).
The intuition behind the unwarping procedure is visualized in the image below for 2×2 pooling.
Here, all channels should get interlaced together to generate the final output O.
On the left, the 2×2 = 4 output images from 2×2 multipooling and on the Right, the final unwarping output O.
Direct unwarping is complex especially with several pooling layers.
This might be a reason why previous work avoided pooling layers.
However, if we observe the problem in dimension space it can easily be solved with solely transpose and reshape operations.
Such operations are supported by most deep learning frameworks as layers.
I won’t go over the details on how the warping procedure is being done as it is far beyond the scope of this article.
Please refer to the paper for more details.
ExperimentsThe authors presented benchmark results comparing the improved network CI and the patch based CNN Cp running on all patches of an image.
The experiments are performed on a GeForce GTX TITAN X.
As can be seen in the Table below, the execution time of Cp roughly scales (as expected) linearly with the image pixels.
CI on the other hand barely takes more time for larger images.
On the other hand, the memory consumption of CI increases nearly linearly.
If not enough memory is available the input image can be split into parts and each part can be processed individually.
Inspecting the speedup column clearly shows CI performs much faster especially on larger images.
Speed benchmark for CI and CpLet’s Speedup out Patch based CNNHere, I am going to explain how you can speedup any patch based CNN of yours using my implementation of “Fast Dense Feature Extraction with CNN’s that have Pooling or Striding Layers”.
The project structure is simple, you have two implementations: pytorch and tensforflow, each contains the following:FDFE.
py – implementation of the all approach layers and pre & post process methods as described in the paperBaseNet.
py – This refers to an implementation of YOUR pre-trained CNN Cp on training patch PSlimNet.
py – This refers to the implementation of CIsample_code.
py – test run1.
Implement your improved network — CIIn order to use your own pre-trained network that operates on patches you would need to:1.
implemented your network in BaseNet.
py accordingly:Duplicate BsetNet.
py model layers according to its order, e.
conv1 = list(base_net.
modules())[change_this_index]For every MaxPool2d layer place multiMaxPooling instead with the decided stride value (sLn)Deplicate unwrapPool layers according to the number of multiMaxPooling in your modelDo not remove the following layers — multiPoolPrepare, unwrapPrepare2.
Running the sample code over your improved networkNow you should sample_code.
py to make sure that the project works correctly.
The test generates a random input image I of size imH X imW and evaluates it on both Cp and CI.
The script continues and evaluates differences between both CNN’s outputs and performs speed bench marking.
There are two modes of operation for CpsinglePatch mode- run Cp over a single patch pH x pW that would get cropped from input image IallPatches mode — run Cp over multiple patches at ones.
here batch_size will determine how many patches would get evaluated at once.
Possible arguments — In sample_code.
py there are initial parameters that could be adjusted like image height ,image width, patch width, patch height etc…3.
What should I Expect to see?The Script outputs the following:aggregated difference between base_net Cp output and slim_net output CI — there should be any major difference between the two outputs as described above.
For Cp, an averaged evaluation per patchFor CI,Total evaluation per frame.
the entire input imageExpected verbose should look like:Total time for C_P: 0.
017114248275756836 sec————————————————————Averaged time for C_I per Patch without warm up: 0.
0010887398617342114 sec——- Comparison between a base_net over all patches output and slim_net ——-aggregated difference percentage = 0.
0000000000 %maximal abs difference = 0.
0000000000 at index i=0,j=0————————————————————And there you go!.you’ve boosted up your network significantly.
Just in this example we boosted up our running time by ~10.
AcknowledgmentsA big thanks to the following individual for helping out discover and implementing this method.
Arnon Kahani — A good friend and an excellent ML engineerConclusionif you’re interested in the source code it can be found in my Fast Dense Feature Extraction for CNNs GitHub repository.
As always, if you have any questions or comments feel free to leave your feedback below or you can always reach me on LinkedIn.
Till then, see you in the next post!.????.. More details