[RFC][FSDP2] Added `register_fsdp_forward_method` for user fwd methods #125394

awgu · 2024-05-02T14:33:29Z

Stack from ghstack (oldest at bottom):

-> [RFC][FSDP2] Added register_fsdp_forward_method for user fwd methods #125394

FSDP only runs its pre/post-forward hooks on nn.Module.forward. This means that if the user runs a custom method meant as a forward pass, then FSDP will not all-gather the parameters. Examples include HuggingFace models' generate() (#123962, #100069) or others (#109385).

This PR adds a monkey patching API register_fsdp_forward_method(module: nn.Module, method_name: str) to allow FSDP pre/post-forward hooks to run on the method. The function is a no-op if the passed-in module is not an FSDP module so that the register function can be called even if the FSDP wrapping changes.

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k

[ghstack-poisoned]

pytorch-bot · 2024-05-02T14:33:32Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125394

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (3 Unrelated Failures)

As of commit 4428b8c with merge base b03fb49 ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

pull / linux-focal-py3.11-clang10 / test (dynamo, 3, 3, linux.2xlarge) (gh) (similar failure)
functorch/test_aotdispatch.py::TestAOTAutograd::test_some_output_requires_grad_input_doesnt
pull / linux-focal-py3.12-clang10 / test (dynamo, 3, 3, linux.2xlarge) (gh) (similar failure)
functorch/test_aotdispatch.py::TestAOTAutograd::test_some_output_requires_grad_input_doesnt
pull / linux-focal-py3.8-clang10 / test (dynamo, 3, 3, linux.2xlarge) (gh) (similar failure)
functorch/test_aotdispatch.py::TestAOTAutograd::test_some_output_requires_grad_input_doesnt

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 01bc2fab00edeb16ce0d2d06d0a784fe3619911f Pull Request resolved: #125394

… fwd methods" FSDP only runs its pre/post-forward hooks on `nn.Module.forward`. This means that if the user runs a custom method meant as a forward pass, then FSDP will not all-gather the parameters. Examples include HuggingFace models' `generate()` (#123962, #100069) or others (#109385). This PR adds a monkey patching API to allow FSDP pre/post-forward hooks to run on the method. cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k [ghstack-poisoned]

ghstack-source-id: ba8e1d1f417cfc622b7098808a508f99968f61be Pull Request resolved: #125394

awgu · 2024-05-02T14:51:44Z

From mosaicml/composer:

# Note: We need to use the FSDP.summon_full_params context manager here because the generate function
# does not seem to gather the weights for the LM head. This solution works because the tied weights of the LM head
# are in the root FSDP module, and are summoned by the below context manager. See https://github.com/pytorch/pytorch/issues/100069
# for more info.
# Note: We use recurse=False here so that we only summon full params for the LM head, not the entire model.
with FSDP.summon_full_params(self.model, writeback=False, recurse=False):
    return self.model.generate(input_ids=input_ids, pad_token_id=pad_token_id, **kwargs)

We should be able to replace this with:

register_fsdp_forward_method(self.model, "generate")  # call once at init time
...
return self.model.generate(input_ids=input_ids, pad_token_id=pad_token_id, **kwargs)

awgu · 2024-05-02T14:56:17Z

torch/distributed/_composable/fsdp/fully_shard.py

@@ -314,3 +315,35 @@ def wait(self):
 self._fsdp_param_group.wait_for_unshard()
 # Avoid keeping a reference
 self._fsdp_param_group = None
+
+
+def register_fsdp_forward_method(module: nn.Module, method_name: str) -> None:


cc: @Skylion007 if you have any opinions on this

Nice! This is fantastic, let me ping some folks composer folks and see if they have any more detailed feedback on this PR. :)

weifengpy

nice. I did not know it only requires a few lines of code to support user-defined fwd.

is this making an assumption that user-defined fwd eg forward_features won't call hooks from nn.Module ? otherwise we will have fsdp_hook(forward_features(fsdp_hook))

awgu · 2024-05-02T21:38:53Z

is this making an assumption that user-defined fwd eg forward_features won't call hooks from nn.Module ? otherwise we will have fsdp_hook(forward_features(fsdp_hook))

Since the user-defined forward method (e.g. forward_features) is not nn.Module.forward, anyway the registered forward hooks on the module would not run for that user-defined method (e.g. forward_features), so I think this is not a concern.

Note that this is only adding FSDP hooks to the user-defined method for that one particular module. Any nested submodules will run forward normally, so if there is a nested FSDP submodule that will just work per normal.

weifengpy · 2024-05-02T21:41:13Z

is this making an assumption that user-defined fwd eg forward_features won't call hooks from nn.Module ? otherwise we will have fsdp_hook(forward_features(fsdp_hook))

Since the user-defined forward method (e.g. forward_features) is not nn.Module.forward, anyway the registered forward hooks on the module would not run for that user-defined method (e.g. forward_features), so I think this is not a concern.

Note that this is only adding FSDP hooks to the user-defined method for that one particular module. Any nested submodules will run forward normally, so if there is a nested FSDP submodule that will just work per normal.

forward_features is under user's control? I guess we are ignoring the chance that user call nn.Module.forward_hooks explicitly in forward_features ?

awgu · 2024-05-02T21:45:08Z

forward_features is under user's control? I guess we are ignoring the chance that user call nn.Module.forward_hooks explicitly in forward_features ?

That is a good point. We are assuming that the user is not calling the hooks themselves in forward_features.

This is not so much of a concern to me because (1) calling the hooks themselves is not too likely to me and (2) FSDP wants to prepend the pre-forward hook anyway. (Post-forward being prepended might be an issue though.)

At least in the use cases we have seen in practice, I think this is okay, but your point is definitely valid.

weifengpy · 2024-05-02T22:42:31Z

user-defined method for that one particular module. Any nested submodules will run forward normally

is the particular module mostly root module? like model.generate() ?

awgu · 2024-05-02T22:44:19Z

user-defined method for that one particular module. Any nested submodules will run forward normally

is the particular module mostly root module? like model.generate() ?

Yes. I mainly have seen it for the root module's .generate(). The vision transformer example was not root though 🤔 .

wanchaol

This sgtm!

awgu · 2024-05-03T18:29:20Z

@pytorchbot merge

pytorchmergebot · 2024-05-03T18:31:02Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

gaotianyu1350 · 2024-05-08T15:35:58Z

Hi @awgu, thanks for the patch! I wonder how this can be used together with torch.distributed.fsdp.fully_sharded_data_parallel.FullyShardedDataParallel? It seems that this is only compatible with the torch.distributed._composable stuff, which I don't quite understand... Thanks!

awgu · 2024-05-13T13:25:13Z

@gaotianyu1350 Sorry, this does not apply to torch.distributed.fsdp.fully_sharded_data_parallel.FullyShardedDataParallel. The current workaround for that is to use summon_full_params(recurse=False).

[RFC][FSDP2] Added register_fsdp_forward_method for user fwd methods

fcad622

[ghstack-poisoned]

pytorch-bot bot added ci-td-distributed oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (fsdp) release notes category labels May 2, 2024

awgu added a commit that referenced this pull request May 2, 2024

[RFC][FSDP2] Added register_fsdp_forward_method for user fwd methods

8843a3d

ghstack-source-id: 01bc2fab00edeb16ce0d2d06d0a784fe3619911f Pull Request resolved: #125394

awgu added release notes: distributed (fsdp2) release notes category and removed release notes: distributed (fsdp) release notes category labels May 2, 2024

awgu added a commit that referenced this pull request May 2, 2024

[RFC][FSDP2] Added register_fsdp_forward_method for user fwd methods

1f3372a

ghstack-source-id: ba8e1d1f417cfc622b7098808a508f99968f61be Pull Request resolved: #125394

awgu requested review from wconstab, wanchaol and weifengpy May 2, 2024 14:39

awgu marked this pull request as ready for review May 2, 2024 14:39

awgu commented May 2, 2024

View reviewed changes

weifengpy approved these changes May 2, 2024

View reviewed changes

awgu added the ciflow/trunk Trigger trunk jobs on your pull request label May 3, 2024

wanchaol approved these changes May 3, 2024

View reviewed changes

pytorchmergebot added the merging label May 3, 2024

pytorchmergebot added the Merged label May 3, 2024

pytorchmergebot closed this in d6052a3 May 3, 2024

pytorchmergebot removed the merging label May 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC][FSDP2] Added `register_fsdp_forward_method` for user fwd methods #125394

[RFC][FSDP2] Added `register_fsdp_forward_method` for user fwd methods #125394

awgu commented May 2, 2024 •

edited

pytorch-bot bot commented May 2, 2024 •

edited

awgu commented May 2, 2024

awgu May 2, 2024

Skylion007 May 2, 2024

weifengpy left a comment

awgu commented May 2, 2024

weifengpy commented May 2, 2024

awgu commented May 2, 2024

weifengpy commented May 2, 2024

awgu commented May 2, 2024

wanchaol left a comment

awgu commented May 3, 2024

pytorchmergebot commented May 3, 2024

gaotianyu1350 commented May 8, 2024

awgu commented May 13, 2024

[RFC][FSDP2] Added register_fsdp_forward_method for user fwd methods #125394

[RFC][FSDP2] Added register_fsdp_forward_method for user fwd methods #125394

Conversation

awgu commented May 2, 2024 • edited

pytorch-bot bot commented May 2, 2024 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125394

✅ You can merge normally! (3 Unrelated Failures)

awgu commented May 2, 2024

awgu May 2, 2024

Choose a reason for hiding this comment

Skylion007 May 2, 2024

Choose a reason for hiding this comment

weifengpy left a comment

Choose a reason for hiding this comment

awgu commented May 2, 2024

weifengpy commented May 2, 2024

awgu commented May 2, 2024

weifengpy commented May 2, 2024

awgu commented May 2, 2024

wanchaol left a comment

Choose a reason for hiding this comment

awgu commented May 3, 2024

pytorchmergebot commented May 3, 2024

Merge started

gaotianyu1350 commented May 8, 2024

awgu commented May 13, 2024

[RFC][FSDP2] Added `register_fsdp_forward_method` for user fwd methods #125394

[RFC][FSDP2] Added `register_fsdp_forward_method` for user fwd methods #125394

awgu commented May 2, 2024 •

edited

pytorch-bot bot commented May 2, 2024 •

edited