You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I think I found a slight performance issue with Hivemind. A call to opt.step() before the TBS is reached that accumulates the gradients is somehow not as performant as the native PyTorch gradient accumulation. There is a trend visible based on the model parameter count, so I presume it's not DHT related.
Here's the experimental "proof". The single GPU experiment is a baseline without hivemind, the 2,3,4, and 8 GPU runs are with hivemind. The first figure shows the average backward_s timing, where no real change in timings can be seen (this is the call in the 1 GPU experiment that does the gradient accumulation; the accumulation for the 2-8 GPU runs happens over opt.step()).
This means that accumulating the gradients is more or less 0 cost for the baseline PyTorch implementation.
The second figure shows the no-sync opt.step() call timings for the 2-8 GPU runs, which show a trend of increased slowdown the bigger the model gets, suggesting something depending on the model parameter count. Maybe there's a GPU->CPU memory copy or GPU internal copy happening?
I also compared the actual throughput impact of this slower no-sync opt.step() , and it reaches at worst only 48% (ConvNextLarge) and at best 78% (ResNet152) of the baseline performance. The throughput compared was the normalized local hivemind throughput (the one without averaging) and the baseline 1-GPU throughput.
Furthermore, this is independent of the TBS, as the following figure shows the same no-sync opt.step() timing (in this case, these were 2xGPU runs):
I think I found a slight performance issue with Hivemind. A call to
opt.step()
before the TBS is reached that accumulates the gradients is somehow not as performant as the native PyTorch gradient accumulation. There is a trend visible based on the model parameter count, so I presume it's not DHT related.Here's the experimental "proof". The single GPU experiment is a baseline without hivemind, the 2,3,4, and 8 GPU runs are with hivemind. The first figure shows the average
backward_s
timing, where no real change in timings can be seen (this is the call in the 1 GPU experiment that does the gradient accumulation; the accumulation for the 2-8 GPU runs happens overopt.step()
).This means that accumulating the gradients is more or less 0 cost for the baseline PyTorch implementation.
The second figure shows the no-sync
opt.step()
call timings for the 2-8 GPU runs, which show a trend of increased slowdown the bigger the model gets, suggesting something depending on the model parameter count. Maybe there's a GPU->CPU memory copy or GPU internal copy happening?I also compared the actual throughput impact of this slower no-sync
opt.step()
, and it reaches at worst only 48% (ConvNextLarge) and at best 78% (ResNet152) of the baseline performance. The throughput compared was the normalized local hivemind throughput (the one without averaging) and the baseline 1-GPU throughput.Furthermore, this is independent of the TBS, as the following figure shows the same no-sync
opt.step()
timing (in this case, these were 2xGPU runs):I think the relevant code is here https://github.com/learning-at-home/hivemind/blob/master/hivemind/optim/grad_averager.py#L129-L148
I'm using the 1.1.6 library version with the following optimizer configuration:
The text was updated successfully, but these errors were encountered: