You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
🐛[BUG]: Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
#691
Open
wlu1998 opened this issue
Oct 16, 2024
· 7 comments
when i train_graphcast on a single player with multiple cards,encounter the following error message:
[rank3]:[E1016 03:33:45.317536384 ProcessGroupNCCL.cpp:568] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=50, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600011 milliseconds before timing out.
[rank3]:[E1016 03:33:45.319844411 ProcessGroupNCCL.cpp:1583] [PG 0 Rank 3] Exception (either an error or timeout) detected by watchdog at work: 50, last enqueued NCCL work: 50, last completed NCCL work: 49.
[rank3]:[E1016 03:33:45.319862434 ProcessGroupNCCL.cpp:1628] [PG 0 Rank 3] Timeout at NCCL work: 50, last enqueued NCCL work: 50, last completed NCCL work: 49.
[rank3]:[E1016 03:33:45.319870659 ProcessGroupNCCL.cpp:582] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E1016 03:33:45.380539738 ProcessGroupNCCL.cpp:568] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=50, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600076 milliseconds before timing out.
[rank2]:[E1016 03:33:45.382855328 ProcessGroupNCCL.cpp:1583] [PG 0 Rank 2] Exception (either an error or timeout) detected by watchdog at work: 50, last enqueued NCCL work: 50, last completed NCCL work: 49.
[rank2]:[E1016 03:33:45.382875465 ProcessGroupNCCL.cpp:1628] [PG 0 Rank 2] Timeout at NCCL work: 50, last enqueued NCCL work: 50, last completed NCCL work: 49.
[rank2]:[E1016 03:33:45.382884031 ProcessGroupNCCL.cpp:582] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E1016 03:33:45.400176821 ProcessGroupNCCL.cpp:568] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=50, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600095 milliseconds before timing out.
[rank1]:[E1016 03:33:45.401444623 ProcessGroupNCCL.cpp:1583] [PG 0 Rank 1] Exception (either an error or timeout) detected by watchdog at work: 50, last enqueued NCCL work: 50, last completed NCCL work: 49.
[rank1]:[E1016 03:33:45.401455763 ProcessGroupNCCL.cpp:1628] [PG 0 Rank 1] Timeout at NCCL work: 50, last enqueued NCCL work: 50, last completed NCCL work: 49.
[rank1]:[E1016 03:33:45.401460371 ProcessGroupNCCL.cpp:582] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E1016 03:34:00.349118707 ProcessGroupNCCL.cpp:568] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=50, OpType=ALLTOALL, NumelIn=50507776, NumelOut=50507776, Timeout(ms)=600000) ran for 600018 milliseconds before timing out.
[rank0]:[E1016 03:34:00.351482167 ProcessGroupNCCL.cpp:1583] [PG 0 Rank 0] Exception (either an error or timeout) detected by watchdog at work: 50, last enqueued NCCL work: 50, last completed NCCL work: 49.
[rank0]:[E1016 03:34:00.351502204 ProcessGroupNCCL.cpp:1628] [PG 0 Rank 0] Timeout at NCCL work: 50, last enqueued NCCL work: 50, last completed NCCL work: 49.
[rank0]:[E1016 03:34:00.351511431 ProcessGroupNCCL.cpp:582] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E1016 03:52:46.582729551 ProcessGroupNCCL.cpp:1304] [PG 0 Rank 0] First PG on this rank that detected no heartbeat of its watchdog.
[rank0]:[E1016 03:52:46.582797502 ProcessGroupNCCL.cpp:1342] [PG 0 Rank 0] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=0
[rank0]:[F1016 04:02:46.583303966 ProcessGroupNCCL.cpp:1168] [PG 0 Rank 0] [PG 0 Rank 0] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 0
I added the following code to the environment variables:
export NCCL_IB_DISABLE=1
export TORCH_NCCL_ENABLE_MONITORING=0
but it didn't solve the problem
For the default configs, you need a GPU with at least 80GB of memory.
Also, GraphCast only supports distributed data parallelism, so your GPU memory requirement does not reduce if you run on multiple GPUs.
I suggest to try getting this to work on a single GPU for now.
You can reduce the size of the model and the memory overhead by changing some configs like processor_layers, hidden_dim, mesh_level: https://github.com/NVIDIA/modulus/blob/main/examples/weather/graphcast/conf/config.yaml#L29
My device is 24*4,so i can only run with with multi-GPU runs. And i want to solve the issue of "Some NCCL operations have failed or timed out".
Thank u for the suggest, i will try. Actually, I have tried to modify the annual data to the daily data.
Version
24.07
On which installation method(s) does this occur?
Docker
Describe the issue
when i train_graphcast on a single player with multiple cards,encounter the following error message:
[rank3]:[E1016 03:33:45.317536384 ProcessGroupNCCL.cpp:568] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=50, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600011 milliseconds before timing out.
[rank3]:[E1016 03:33:45.319844411 ProcessGroupNCCL.cpp:1583] [PG 0 Rank 3] Exception (either an error or timeout) detected by watchdog at work: 50, last enqueued NCCL work: 50, last completed NCCL work: 49.
[rank3]:[E1016 03:33:45.319862434 ProcessGroupNCCL.cpp:1628] [PG 0 Rank 3] Timeout at NCCL work: 50, last enqueued NCCL work: 50, last completed NCCL work: 49.
[rank3]:[E1016 03:33:45.319870659 ProcessGroupNCCL.cpp:582] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E1016 03:33:45.380539738 ProcessGroupNCCL.cpp:568] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=50, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600076 milliseconds before timing out.
[rank2]:[E1016 03:33:45.382855328 ProcessGroupNCCL.cpp:1583] [PG 0 Rank 2] Exception (either an error or timeout) detected by watchdog at work: 50, last enqueued NCCL work: 50, last completed NCCL work: 49.
[rank2]:[E1016 03:33:45.382875465 ProcessGroupNCCL.cpp:1628] [PG 0 Rank 2] Timeout at NCCL work: 50, last enqueued NCCL work: 50, last completed NCCL work: 49.
[rank2]:[E1016 03:33:45.382884031 ProcessGroupNCCL.cpp:582] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E1016 03:33:45.400176821 ProcessGroupNCCL.cpp:568] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=50, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600095 milliseconds before timing out.
[rank1]:[E1016 03:33:45.401444623 ProcessGroupNCCL.cpp:1583] [PG 0 Rank 1] Exception (either an error or timeout) detected by watchdog at work: 50, last enqueued NCCL work: 50, last completed NCCL work: 49.
[rank1]:[E1016 03:33:45.401455763 ProcessGroupNCCL.cpp:1628] [PG 0 Rank 1] Timeout at NCCL work: 50, last enqueued NCCL work: 50, last completed NCCL work: 49.
[rank1]:[E1016 03:33:45.401460371 ProcessGroupNCCL.cpp:582] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E1016 03:34:00.349118707 ProcessGroupNCCL.cpp:568] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=50, OpType=ALLTOALL, NumelIn=50507776, NumelOut=50507776, Timeout(ms)=600000) ran for 600018 milliseconds before timing out.
[rank0]:[E1016 03:34:00.351482167 ProcessGroupNCCL.cpp:1583] [PG 0 Rank 0] Exception (either an error or timeout) detected by watchdog at work: 50, last enqueued NCCL work: 50, last completed NCCL work: 49.
[rank0]:[E1016 03:34:00.351502204 ProcessGroupNCCL.cpp:1628] [PG 0 Rank 0] Timeout at NCCL work: 50, last enqueued NCCL work: 50, last completed NCCL work: 49.
[rank0]:[E1016 03:34:00.351511431 ProcessGroupNCCL.cpp:582] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E1016 03:52:46.582729551 ProcessGroupNCCL.cpp:1304] [PG 0 Rank 0] First PG on this rank that detected no heartbeat of its watchdog.
[rank0]:[E1016 03:52:46.582797502 ProcessGroupNCCL.cpp:1342] [PG 0 Rank 0] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList_.size()=0
[rank0]:[F1016 04:02:46.583303966 ProcessGroupNCCL.cpp:1168] [PG 0 Rank 0] [PG 0 Rank 0] ProcessGroupNCCL's watchdog got stuck for 600 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api, or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang. workMetaList_.size() = 0
I added the following code to the environment variables:
export NCCL_IB_DISABLE=1
export TORCH_NCCL_ENABLE_MONITORING=0
but it didn't solve the problem
Minimum reproducible example
No response
Relevant log output
No response
Environment details
docker run -dit --restart always --shm-size 64G --ulimit memlock=-1 --ulimit stack=67108864 --runtime nvidia -v /data/modulus:/data/modulus --name modulus --gpus all nvcr.io/nvidia/modulus/modulus:24.07 /bin/bash
The text was updated successfully, but these errors were encountered: