Local Gradient Accumulation is slower than the PyTorch implementation. #566

cirquit · 2023-05-03T17:00:55Z

I think I found a slight performance issue with Hivemind. A call to opt.step() before the TBS is reached that accumulates the gradients is somehow not as performant as the native PyTorch gradient accumulation. There is a trend visible based on the model parameter count, so I presume it's not DHT related.

Here's the experimental "proof". The single GPU experiment is a baseline without hivemind, the 2,3,4, and 8 GPU runs are with hivemind. The first figure shows the average backward_s timing, where no real change in timings can be seen (this is the call in the 1 GPU experiment that does the gradient accumulation; the accumulation for the 2-8 GPU runs happens over opt.step()).
This means that accumulating the gradients is more or less 0 cost for the baseline PyTorch implementation.

The second figure shows the no-sync opt.step() call timings for the 2-8 GPU runs, which show a trend of increased slowdown the bigger the model gets, suggesting something depending on the model parameter count. Maybe there's a GPU->CPU memory copy or GPU internal copy happening?

I also compared the actual throughput impact of this slower no-sync opt.step() , and it reaches at worst only 48% (ConvNextLarge) and at best 78% (ResNet152) of the baseline performance. The throughput compared was the normalized local hivemind throughput (the one without averaging) and the baseline 1-GPU throughput.

Furthermore, this is independent of the TBS, as the following figure shows the same no-sync opt.step() timing (in this case, these were 2xGPU runs):

I think the relevant code is here https://github.com/learning-at-home/hivemind/blob/master/hivemind/optim/grad_averager.py#L129-L148

I'm using the 1.1.6 library version with the following optimizer configuration:

fp16 amp
grad_compression=fp16
use_local_updates=False
delay_optimizer_step=True
delay_state_averaging=True
batch_size_per_step=32
target_batch_size=32768
matchmaking_time=5

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Local Gradient Accumulation is slower than the PyTorch implementation. #566

Local Gradient Accumulation is slower than the PyTorch implementation. #566

cirquit commented May 3, 2023 •

edited

Loading

Local Gradient Accumulation is slower than the PyTorch implementation. #566

Local Gradient Accumulation is slower than the PyTorch implementation. #566

Comments

cirquit commented May 3, 2023 • edited Loading

cirquit commented May 3, 2023 •

edited

Loading