How to scale learning rate with batch size for DDP training? #3706
-
When using LARS optimizer, usually the batch size is scale linearly with the learning rate. However when I use 2 GPUs with DDP backend and batch size of 512 on each GPU. Should my learning rate be:
|
Beta Was this translation helpful? Give feedback.
Replies: 8 comments 13 replies
-
Just to clarify if you use batch_size=512 in DDP backend, each GPU will train on 512 batch_size in lightning. Do you want 512 on each or 256 on each GPU? |
Beta Was this translation helpful? Give feedback.
-
Hi I want each GPUs to has batch size of 512. So two GPUs will have a total batch size of 1024. I don't know if I should set the learinng rate base on the total batch size or batch size on each GPU |
Beta Was this translation helpful? Give feedback.
-
in DDP the gradients are averaged and synced across each device before |
Beta Was this translation helpful? Give feedback.
-
As far as I know, learning rate is scaled with the batch size so that the sample variance of the gradients is kept approx. constant. Since DDP averages the gradients from all the devices, I think the LR should be scaled in proportion to the effective batch size, namely, batch_size * num_accumulated_batches * num_gpus * num_nodes In this case, assuming batch_size=512, num_accumulated_batches=1, num_gpus=2 and num_noeds=1 the effective batch size is 1024, thus the LR should be scaled by sqrt(2), compared to a single gpus with effective batch size 512. |
Beta Was this translation helpful? Give feedback.
-
Thank you all for your answers. I'll scale the LR with the total effective batch size. |
Beta Was this translation helpful? Give feedback.
-
This is mentioned very briefly in the DDP documentation, perhaps it should also be mentioned in the TPU section in the docs since TPU uses DDP. This was my case, and I understood it as needing to scale batch size to match effective learning rate, but it was hard to find confirmation on this even with several threads on the subject in various places. |
Beta Was this translation helpful? Give feedback.
-
@itsikad Hi, I read your explanation and it makes sense to me, but when I ran an experiment with Lightning DDP I got:
Which seems to show that linear scaling instead of scaling with the sqrt() is what actually enables scaling GPUs while maintaining performance for me. I assume I'm missing something, but can you help me understand how to unite your theoretical explanation with my results? I think this paper also suggests linear scaling: https://arxiv.org/abs/1706.02677 |
Beta Was this translation helpful? Give feedback.
-
Since there is still confusion on this topic, below my understanding.
The confusion arises because keeping 'batch_size' the same in your script and going multi-GPU changes the effective batch size. 1. How to scale 'batch_size' parameter when using multiple GPUs to keep training unmodified**Pytorch averages loss across the minibatch by default (reduce='mean' is the default in loss functions). Say you train on images with batch_size=B on 1 GPU, and now use DDP with N GPUs setting batch_size=B as well. If you want DDP to do the same as the 1-GPU case, you need to set batch_size to B/N. Then each GPU processes B/N elements, averages locally across B/N elements, and then DDP averages across the N GPUs so all together you average across B/N * N = B elements. Note: All this is different from DataParallel which does split the B images across GPUs so that each GPU gets B/N images to process, and then sums up the gradients (sum, not average like DDP). There you would need to divide the learning rate by N. Note: if your model uses BN you'll need to use Syncronized BN, as remarked in Section 2.3 in Goyal et al so that the statistics are not computed across only B/N but across B examples. The confusion arises from the fact that setting batch_size=B/N with DDP doesn't scale to large N because then each GPU sees only a few images and GPUs run most efficient when processing a lot of items in parallel, i.e. large batches. 2. How to scale learning rate when increasing the batch sizeIf you had a single gigantic GPU you also increase your batch_size to B*N. In practice we can't do this because we choose B as large as the GPU memory allows to maximize GPU efficiency, but imagine we have infinite GPU memory and we do that. The important thing to understand is that contrary to before, now we are modifying the optimization process and you will NOT get the same results. There will be N times less iterations in an epoch, so N times less weight updates. One way is to scale lr_new = sqrt(N) * lr which preserves the gradient variances. Note: in this case, if your model use BN you should to disable SyncBN so that statistics are still calculated across B examples like in the original case (again, Goyal et al section 2.3). Otherwise the linear LR scaling rule may not work because the loss landscape is different. |
Beta Was this translation helpful? Give feedback.
As far as I know, learning rate is scaled with the batch size so that the sample variance of the gradients is kept approx. constant.
Since DDP averages the gradients from all the devices, I think the LR should be scaled in proportion to the effective batch size, namely, batch_size * num_accumulated_batches * num_gpus * num_nodes
In this case, assuming batch_size=512, num_accumulated_batches=1, num_gpus=2 and num_noeds=1 the effective batch size is 1024, thus the LR should be scaled by sqrt(2), compared to a single gpus with effective batch size 512.