You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I tried to compile the MS-AMP optimizer with the new Torch 2.2:
cd msamp/optim
pip install -v .
but got this error:
File "/scratch/brr/MS-AMP/msamp/optim/setup.py", line 7, in <module>
from torch.utils import cpp_extension
File "/scratch/miniconda3/envs/brr/lib/python3.10/site-packages/torch/__init__.py", line 237, in <module>
from torch._C import * # noqa: F403
ImportError: /scratch/miniconda3/envs/brr/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so: undefined symbol: ncclCommRegister
error: subprocess-exited-with-error
× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> See above for output.
How to reproduce it?:
Running this code in Python reproduces the error:
>>> import torch
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/scratch/miniconda3/envs/brr/lib/python3.10/site-packages/torch/__init__.py", line 237, in <module>
from torch._C import * # noqa: F403
ImportError: /scratch/miniconda3/envs/brr/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so: undefined symbol: ncclCommRegister
Log message or shapshot?:
See above
Additional information:
My best guess is that this is caused by MS-AMP being pinned to an external old version of libnccl (2.17.1), while PyTorch 2.2 seems to depend on a newer version (2.19.3).
The text was updated successfully, but these errors were encountered:
We haven't test MS-AMP with pytorch 22. Currently we only support pytorch1.14 and 2.1. And it is recommended to use our docker image or nvcr.io/nvidia/pytorch:23.10-py3. And we have plan to upgrade msccl to latest version.
What's the issue, what's expected?:
I tried to compile the MS-AMP optimizer with the new Torch 2.2:
but got this error:
How to reproduce it?:
Running this code in Python reproduces the error:
Log message or shapshot?:
See above
Additional information:
My best guess is that this is caused by MS-AMP being pinned to an external old version of libnccl (2.17.1), while PyTorch 2.2 seems to depend on a newer version (2.19.3).
The text was updated successfully, but these errors were encountered: