[Feature] Add gradient accumulation #292

XinDongol · 2024-05-01T05:00:22Z

Gradient accumulation (micro step) could be very useful when we want to have large batch size but with limited number of gpus.

wanchaol · 2024-05-01T18:58:29Z

@XinDongol do you mean microbatching or pipeline parallel?

lessw2020 · 2024-05-01T19:59:15Z

@awgu - is there a context manager or similar option in fsdp2 that would support gradient accumulation and thus enable this in titan? I know we talked about this for HSDP but not sure about generic FSDP2.

awgu · 2024-05-01T20:05:10Z

I am guessing this is asking for normal microbatching. There are similar APIs for FSDP2 that can control communication during gradient accumulation.

We migrated the no_sync() context to directly just module.set_requires_gradient_sync(bool) so that it can be just placed at the top of the training loop as module.set_requires_gradient_sync(is_last_microbatch). Note however though, that typically for memory constrained cases, we prefer to just proceed as normal and reduce-scatter every microbatch.

XinDongol · 2024-05-01T22:16:21Z

Thanks for updating.
@wanchaol Yes, I am talking about microbatching.

torchtitan/train.py

Lines 291 to 294 in 58b1169

 with loss_parallel_ctx(): 

 pred = model(input_ids) 

 loss = loss_fn(pred, labels) 

 loss.backward()

@awgu is it sufficient to change ? Thanks
from (current)

with loss_parallel_ctx():
    pred = model(input_ids)
    loss = loss_fn(pred, labels)
    loss.backward()

to

for microbatch_idx in range(microbatch):
	batch = next(data_iterator)
    input_ids, labels = batch
	model.set_requires_gradient_sync(microbatch_idx==(microbatch-1))
	with loss_parallel_ctx():
	    pred = model(input_ids)
	    loss = loss_fn(pred, labels) / microbatch
	    loss.backward()

awgu · 2024-05-01T22:21:08Z

@XinDongol I think that is sufficient.

If you want to avoid reduce-scatter in backward, then what you have is right. Note however that this will mean that gradients are left as unsharded through backward, which may use too much memory depending on the workload.

If you want to still reduce-scatter in backward, you can simply remove that model.set_requires_gradient_sync line (effectively leaving it as the default of True).

tianyu-l added the enhancement New feature or request label May 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Add gradient accumulation #292

[Feature] Add gradient accumulation #292

XinDongol commented May 1, 2024 •

edited

wanchaol commented May 1, 2024

lessw2020 commented May 1, 2024

awgu commented May 1, 2024

XinDongol commented May 1, 2024 •

edited

awgu commented May 1, 2024

[Feature] Add gradient accumulation #292

[Feature] Add gradient accumulation #292

Comments

XinDongol commented May 1, 2024 • edited

wanchaol commented May 1, 2024

lessw2020 commented May 1, 2024

awgu commented May 1, 2024

XinDongol commented May 1, 2024 • edited

awgu commented May 1, 2024

XinDongol commented May 1, 2024 •

edited

XinDongol commented May 1, 2024 •

edited