In practice, data scientists tend to play with different batch sizes and see what works but those methods often result in computationally inaccurate and expensive training models.Gradient Noise ScaleTo measure the right training batch size for a deep learning model, OpenAI introduced a simple statistic called gradient noise scale(GNS)..Conceptually, GNS quantifies the signal-to-noise ratio of the network gradients to approximately predict the maximum useful batch size..Heuristically, GNS measures the variation in the data as seen by the model (at a given stage in training)..When the GNS is small, looking at a lot of data in parallel quickly becomes redundant, whereas when it is large, we can still learn a lot from huge batches of data.Another way to think about GNS is as a predictor of the shape of the compute/time tradeoff curve..In a deep learning model, the technical feasibility is often determined by the training time while its economic viability is related to the compute cost..Finding the right tradeoff between those two elements is the key for building an effective model..At very small batch sizes, doubling the batch allows us to train in half the time without using extra compute (you can run twice as many chips for half as long)..At very large batch sizes, more parallelization doesn’t lead to faster training..There is a “bend” in the curve in the middle, and the GNS metric predicts where that bend occurs.The use of GNS in the context of deep learning models is relatively simple..Suppose that we are starting to train a specific deep learning model using a large dataset..In that scenario, we can parallelize the training almost linearly until the batch size equals the GNS..After that point, there should be a smooth but relatively rapid switch to a regime where further parallelism provides minimal benefits.. More details