Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot run it on windows #14

Open
frankl1 opened this issue Feb 14, 2024 · 5 comments
Open

Cannot run it on windows #14

frankl1 opened this issue Feb 14, 2024 · 5 comments
Labels
enhancement New feature or request

Comments

@frankl1
Copy link

frankl1 commented Feb 14, 2024

Hi,

I was trying to give try to this implementation after reading the paper. I installed all the dependencies in a Conda env on a Window PC. However, I am having the following error when I run the experiment:

$ python experiment.py -d tic-tac-toe -bs 32 -s 1@16 -e401 -lrde 200 -lr 0.002 -ki 0 -wd 0.0001 --print_rule -i 0
C:\Users\m00827298\AppData\Local\miniconda3\envs\rrl\Lib\site-packages\torch\distributed\distributed_c10d.py:608: UserWarning: Attempted 
to get default timeout for nccl backend, but NCCL support is not compiled
  warnings.warn("Attempted to get default timeout for nccl backend, but NCCL support is not compiled")
[W socket.cpp:697] [c10d] The client socket has failed to connect to [A2207000547.china.huawei.com]:47339 (system error: 10049 - The requested address is not valid in its context.).
Traceback (most recent call last):
  File "C:\Users\m00827298\Codes\RRL\experiment.py", line 174, in <module>
    train_main(rrl_args)
  File "C:\Users\m00827298\Codes\RRL\experiment.py", line 167, in train_main
    mp.spawn(train_model, nprocs=args.gpus, args=(args,))
  File "C:\Users\m00827298\AppData\Local\miniconda3\envs\rrl\Lib\site-packages\torch\multiprocessing\spawn.py", line 241, in spawn       
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\m00827298\AppData\Local\miniconda3\envs\rrl\Lib\site-packages\torch\multiprocessing\spawn.py", line 197, in start_processes
    while not context.join():
              ^^^^^^^^^^^^^^
  File "C:\Users\m00827298\AppData\Local\miniconda3\envs\rrl\Lib\site-packages\torch\multiprocessing\spawn.py", line 158, in join        
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "C:\Users\m00827298\AppData\Local\miniconda3\envs\rrl\Lib\site-packages\torch\multiprocessing\spawn.py", line 68, in _wrap        
    fn(i, *args)
  File "C:\Users\m00827298\Codes\RRL\experiment.py", line 57, in train_model
    dist.init_process_group(backend='nccl', init_method='env://', world_size=args.world_size, rank=rank)
  File "C:\Users\m00827298\AppData\Local\miniconda3\envs\rrl\Lib\site-packages\torch\distributed\c10d_logger.py", line 86, in wrapper    
    func_return = func(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\m00827298\AppData\Local\miniconda3\envs\rrl\Lib\site-packages\torch\distributed\distributed_c10d.py", line 1177, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
                              ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\m00827298\AppData\Local\miniconda3\envs\rrl\Lib\site-packages\torch\distributed\rendezvous.py", line 246, in _env_rendezvous_handler
    store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout, use_libuv)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\m00827298\AppData\Local\miniconda3\envs\rrl\Lib\site-packages\torch\distributed\rendezvous.py", line 174, in _create_c10d_store
    return TCPStore(
           ^^^^^^^^^
torch.distributed.DistNetworkError: Unknown error
@12wang3
Copy link
Owner

12wang3 commented Feb 19, 2024

I am not very familiar with running PyTorch in a Windows environment. Based on the error message "Attempted to get default timeout for nccl backend, but NCCL support is not compiled", I suspect the reason might be that NCCL support is not compiled into your PyTorch installation.

@frankl1
Copy link
Author

frankl1 commented Feb 21, 2024

NCCL seems to be related to NVidia GPU and I don't NVidia on my PC so I guess this is the reason I have this warning. Isn't it possible to run the code using only the CPU?

@12wang3 12wang3 added the enhancement New feature or request label Mar 17, 2024
@12wang3
Copy link
Owner

12wang3 commented Mar 17, 2024

At present, CPU is not supported. I will add a CPU version in the future. However, it is still recommended to run on a GPU, otherwise the speed may be slow.

@wanmaxiaobai
Copy link

"I would like to ask if your issue has been resolved?"

@frankl1
Copy link
Author

frankl1 commented Apr 29, 2024

Thanks for asking. I will give it another try when I get a GPU

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants