-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] #827
Comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Issue Title
Megatron-LM: Zero-1 with Distributed Optimizer Showing No Overlap in Communication and Computation
Issue Description
We are experiencing an issue with Megatron-LM where enabling zero-1 (--overlap-grad-reduce --overlap-param-gather) along with the distributed optimizer (use-distributed-optimizer) does not result in the expected overlap of communication and computation during training. Despite our attempts to increase CUDA_DEVICE_MAX_CONNECTIONS, we still observe serial execution of communication and computation steps.
Steps to Reproduce
Set up Megatron-LM training with zero-1 enabled (--overlap-grad-reduce --overlap-param-gather).
Enable the distributed optimizer (use-distributed-optimizer).
Attempt to increase CUDA_DEVICE_MAX_CONNECTIONS to a higher value.
Start the training process and observe the execution flow.
Expected Behavior
We expect to see overlap in communication and computation during training, as enabled by zero-1 and the distributed optimizer.
Actual Behavior
Communication and computation steps are executed serially, without any overlap.
Environment
Megatron-LM version: cafda95
PyTorch version: 2.1.0a0+32f93b1
CUDA version: 12.2
GPU models and configuration: H800
Timeline
The text was updated successfully, but these errors were encountered: