Sometimes I run into a problem:

OOM when allocating tensor with shape


OOM when allocating tensor with shape (1024, 100, 160)

Where 1024 is my batch size and I don't know what's the rest. If I reduce the batch size or the number of neurons in the model, it runs fine.

Is there a generic way to calculate optimal batch size based on model and GPU memory, so the program doesn't crash?


Since my question might seem unclear, let me put it his way: I want the largest batch size possible in terms of my model, which will fit into my GPU memory and won't crash the program.


To whoever voted for closing the question for being too broad: How on earth is the question too broad? There is some algorithm which selects a portion of data to put in GPU memory. It clearly is imperfect since the data sometimes exceeds the the GPU memory. Asking for how the algorithm works, in order to prevent random crashes, seems quite reasonable to me.

    You can estimate the largest batch size using:

    Max batch size=available GPU memory bytes / 4 / (size of tensors + trainable parameters)

    • 1
      How do I get the size of tensors and the number trainable parameters? Aren't you missing the model size in the equation?– AndrzejOct 10 '17 at 0:11
    • @ilan interesting - could you point to some reference?– desertnautOct 10 '17 at 8:52
    • @gisek the model size is actually the no of training parameters, which in Keras you get with model.summary()– desertnautOct 10 '17 at 8:53
    • 1
      @desertnaut hehe, I didn't get that "no" stands for "number". Now it makes sense :)– AndrzejOct 13 '17 at 17:56
    • 2
      @Melike Each layer has its tensor + one or more weight matrices (usually referred to as trainable parameters). For example: if you're feeding your network with 200x200 RGB images, then the size of your input tensor (in bytes) is [batch size] * 3 * 200 * 200 ( * 4 if you use 64bit integers)– ilanJul 5 at 11:43

    From the recent Deep Learning book by Goodfellow et al., chapter 8:

    Minibatch sizes are generally driven by the following factors:

    • Larger batches provide a more accurate estimate of the gradient, butwith less than linear returns.
    • Multicore architectures are usuallyunderutilized by extremely small batches. This motivates using someabsolute minimum batch size, below which there is no reduction in thetime to process a minibatch.
    • If all examples in the batch are to beprocessed in parallel (as is typically the case), then the amount ofmemory scales with the batch size. For many hardware setups this isthe limiting factor in batch size.
    • Some kinds of hardware achievebetter runtime with specific sizes of arrays. Especially when usingGPUs, it is common for power of 2 batch sizes to offer better runtime.Typical power of 2 batch sizes range from 32 to 256, with 16 sometimesbeing attempted for large models.
    • Small batches can offer aregularizing effect (Wilson and Martinez, 2003), perhaps due to thenoise they add to the learning process. Generalization error is oftenbest for a batch size of 1. Training with such a small batch sizemight require a small learning rate to maintain stability because ofthe high variance in the estimate of the gradient. The total runtimecan be very high as a result of the need to make more steps, bothbecause of the reduced learning rate and because it takes more stepsto observe the entire training set.

    Which in practice usually means "in powers of 2 and the larger the better, provided that the batch fits into your (GPU) memory".

    You might want also to consult several good posts here in Stack Exchange:

    Just keep in mind that the paper by Keskar et al. 'On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima', quoted by several of the posts above, has received some objections by other respectable researchers of the deep learning community.

    Hope this helps...

    UPDATE (Dec 2017): There is a new paper by Yoshua Bengio & team, Three Factors Influencing Minima in SGD (Nov 2017); it is worth reading in the sense that it reports new theoretical & experimental results on the interplay between learning rate and batch size.

    • It's does not really answer my question. I want the largest batch size possible in terms of my model, which will fit into my GPU memory.– AndrzejOct 9 '17 at 23:00
    • Understood. In practice, especially if you use a GPU, the powers of 2 requirement is so limiting that, even if you get an 'optimal' size of, say, 800, you never use it; what you do is start with an n (power of 2) and, if you get an OOM, try with n/2, then with n/4 etc (if not, you try 2*n) - see 4th bullet above– desertnautOct 10 '17 at 9:05
    • Going down with the size if a error occurs is a big nuisance when you're experimenting with hyperparameters and topologies. A generic formula would be great. Even if the result would be rounded to the power of 2.– AndrzejOct 10 '17 at 9:14

    I ran into a similar GPU mem error which was solved by configuring the tensorflow session with the following:

    # See

    see: google colaboratory `ResourceExhaustedError` with GPU

    • Unfortunately, it changes nothing for a large network :(– AndrzejFeb 5 at 0:46
    • Yes. In my case colaboratory launches with 12GB but with the option enabled it can grow to 52GB– michaelFeb 5 at 1:04

    Your Answer


    By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

    Not the answer you're looking for? Browse other questions tagged or ask your own question.