Photo by Fancycrave.
com from PexelsIn-place Operations in PyTorchWhat are they and why avoid themAlexandra DeisBlockedUnblockFollowFollowingJul 10Today’s advanced deep neural networks have millions of trainable parameters (for example, see the comparison in this paper) and trying to train them on free GPU’s like Kaggle or Google Colab often leads to running out of memory on GPU.
There are several simple ways to reduce the GPU memory occupied by the model, for example:Consider changing the architecture of the model or using the type of model with fewer trainable parameters (for example, choose DenseNet-121 over DenseNet-169).
This approach can affect the model’s performance metrics.
Reduce the batch size or manually set the number of data loader workers.
In this case, it takes longer for the model to train.
Using in-place operations in neural networks may help to avoid the downsides of approaches mentioned above while saving some GPU memory.
However, it is not recommended to use in-place operations for several reasons.
In this article I would like to:Describe what are the in-place operations and demonstrate how they might help to save the GPU memory.
Tell why we should avoid the in-place operations or use them with great caution.
In-place Operations“In-place operation is an operation that directly changes the content of a given linear algebra, vector, matrices (Tensor) without making a copy.
” — The definition is taken from this Python tutorial.
According to the definition, in-place operations don’t make a copy of the input.
That is why they can help to reduce memory usage when operating with high-dimensional data.
I want to demonstrate how in-place operations help to consume less GPU memory.
To do this, I am going to measure allocated memory for both out-of-place ReLU and in-place ReLU from PyTorch, with this simple function:Function to measure the allocated memoryCall the function to measure the allocated memory for the out-of-place ReLU:Measure the allocated memory for the out-of-place ReLUI receive the output like this:Allocated memory: 382.
0Allocated max memory: 382.
0Then call the in-place ReLU as follows:Measure the allocated memory for the in-place ReLUThe output I receive is like:Allocated memory: 0.
0Allocated max memory: 0.
0Looks like using in-place operations helps us to save some GPU memory.
However, we should be extremely cautious when using in-place operations and check twice.
In the next section, I am going to tell you why.
Downsides of In-place OperationsThe major downside of in-place operations is the fact that they might overwrite values required to compute gradients, which means breaking the training procedure of the model.
That is what the official PyTorch autograd documentation says:Supporting in-place operations in autograd is a hard matter, and we discourage their use in most cases.
Autograd’s aggressive buffer freeing and reuse makes it very efficient and there are very few occasions when in-place operations actually lower memory usage by any significant amount.
Unless you’re operating under heavy memory pressure, you might never need to use them.
There are two main reasons that limit the applicability of in-place operations:1.
In-place operations can potentially overwrite values required to compute gradients.
Every in-place operation actually requires the implementation to rewrite the computational graph.
Out-of-place versions simply allocate new objects and keep references to the old graph, while in-place operations, require changing the creator of all inputs to the Function representing this operation.
The other reason for being careful with in-place operations is that their implementation is exceptionally tricky.
That is why I would recommend using PyTorch standard in-place operations (like in-place ReLU above) instead of implementing one manually.
Let’s see an example of SiLU (or Swish-1) activation function.
This is the out-of-place implementation of SiLU:Out-of-place SiLU implementationLet’s try to implement in-place SiLU using torch.
sigmoid_ in-place function:Incorrect implementation of in-place SiLUThe code above incorrectly implements the in-place SiLU.
We can make sure of that bu merely comparing the values returned by both functions.
Actually, the function silu_inplace_1 returns sigmoid(input) * sigmoid(input)!.The working example of the in-place implementation of SiLU using torch.
sigmoid_ would be:This small example demonstrates why we should be cautious and check twice when using in-place operations.
ConclusionIn this article:I described the in-place operations and their purpose.
Demonstrated how in-place operations help to consume less GPU memory.
I described the significant downsides of in-place operations.
One should be very careful about using them and check the result twice.