[BUG] #827

chrisgao7 · 2024-05-14T12:44:40Z

Issue Title
Megatron-LM: Zero-1 with Distributed Optimizer Showing No Overlap in Communication and Computation

Issue Description
We are experiencing an issue with Megatron-LM where enabling zero-1 (--overlap-grad-reduce --overlap-param-gather) along with the distributed optimizer (use-distributed-optimizer) does not result in the expected overlap of communication and computation during training. Despite our attempts to increase CUDA_DEVICE_MAX_CONNECTIONS, we still observe serial execution of communication and computation steps.

Steps to Reproduce
Set up Megatron-LM training with zero-1 enabled (--overlap-grad-reduce --overlap-param-gather).
Enable the distributed optimizer (use-distributed-optimizer).
Attempt to increase CUDA_DEVICE_MAX_CONNECTIONS to a higher value.
Start the training process and observe the execution flow.
Expected Behavior
We expect to see overlap in communication and computation during training, as enabled by zero-1 and the distributed optimizer.

Actual Behavior
Communication and computation steps are executed serially, without any overlap.

Environment
Megatron-LM version: cafda95
PyTorch version: 2.1.0a0+32f93b1
CUDA version: 12.2
GPU models and configuration: H800

Timeline

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] #827

[BUG] #827

chrisgao7 commented May 14, 2024

[BUG] #827

[BUG] #827

Comments

chrisgao7 commented May 14, 2024