Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizer compilation fails with PyTorch 2.2 #158

Open
rosario-purple opened this issue Feb 2, 2024 · 2 comments
Open

Optimizer compilation fails with PyTorch 2.2 #158

rosario-purple opened this issue Feb 2, 2024 · 2 comments

Comments

@rosario-purple
Copy link

What's the issue, what's expected?:

I tried to compile the MS-AMP optimizer with the new Torch 2.2:

cd msamp/optim
pip install -v .

but got this error:

    File "/scratch/brr/MS-AMP/msamp/optim/setup.py", line 7, in <module>
      from torch.utils import cpp_extension
    File "/scratch/miniconda3/envs/brr/lib/python3.10/site-packages/torch/__init__.py", line 237, in <module>
      from torch._C import *  # noqa: F403
  ImportError: /scratch/miniconda3/envs/brr/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so: undefined symbol: ncclCommRegister
  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> See above for output.

How to reproduce it?:

Running this code in Python reproduces the error:

>>> import torch
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/scratch/miniconda3/envs/brr/lib/python3.10/site-packages/torch/__init__.py", line 237, in <module>
    from torch._C import *  # noqa: F403
ImportError: /scratch/miniconda3/envs/brr/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so: undefined symbol: ncclCommRegister

Log message or shapshot?:

See above

Additional information:

My best guess is that this is caused by MS-AMP being pinned to an external old version of libnccl (2.17.1), while PyTorch 2.2 seems to depend on a newer version (2.19.3).

@tocean
Copy link
Contributor

tocean commented Feb 7, 2024

We haven't test MS-AMP with pytorch 22. Currently we only support pytorch1.14 and 2.1. And it is recommended to use our docker image or nvcr.io/nvidia/pytorch:23.10-py3. And we have plan to upgrade msccl to latest version.

@tocean tocean closed this as completed Aug 13, 2024
@tocean tocean reopened this Aug 13, 2024
@tocean
Copy link
Contributor

tocean commented Aug 14, 2024

Can you share me the complete steps of reproducing this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants