Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] #827

Open
chrisgao7 opened this issue May 14, 2024 · 0 comments
Open

[BUG] #827

chrisgao7 opened this issue May 14, 2024 · 0 comments

Comments

@chrisgao7
Copy link

Issue Title
Megatron-LM: Zero-1 with Distributed Optimizer Showing No Overlap in Communication and Computation

Issue Description
We are experiencing an issue with Megatron-LM where enabling zero-1 (--overlap-grad-reduce --overlap-param-gather) along with the distributed optimizer (use-distributed-optimizer) does not result in the expected overlap of communication and computation during training. Despite our attempts to increase CUDA_DEVICE_MAX_CONNECTIONS, we still observe serial execution of communication and computation steps.

Steps to Reproduce
Set up Megatron-LM training with zero-1 enabled (--overlap-grad-reduce --overlap-param-gather).
Enable the distributed optimizer (use-distributed-optimizer).
Attempt to increase CUDA_DEVICE_MAX_CONNECTIONS to a higher value.
Start the training process and observe the execution flow.
Expected Behavior
We expect to see overlap in communication and computation during training, as enabled by zero-1 and the distributed optimizer.

Actual Behavior
Communication and computation steps are executed serially, without any overlap.

Environment
Megatron-LM version: cafda95
PyTorch version: 2.1.0a0+32f93b1
CUDA version: 12.2
GPU models and configuration: H800

Timeline
企业微信截图_17156892613507

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant