You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[E108 20:04:34.467642512 socket.cpp:957] [c10d] The client socket has timed out after 900s while trying to connect to (33.145.124.128, 29500).
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
return f(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 255, in launch_agent
result = agent.run()
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 124, in wrapper
result = f(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 680, in run
result = self._invoke_run(role)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 829, in _invoke_run
self._initialize_workers(self._worker_group)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 124, in wrapper
result = f(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 652, in _initialize_workers
self._rendezvous(worker_group)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 124, in wrapper
result = f(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 489, in _rendezvous
rdzv_info = spec.rdzv_handler.next_rendezvous()
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/static_tcp_rendezvous.py", line 66, in next_rendezvous
self._store = TCPStore( # type: ignore[call-arg]
torch.distributed.DistNetworkError: The client socket has timed out after 900s while trying to connect to (33.145.124.128, 29500).
Reminder
System Info
与环境无关
Reproduction
使用的config文件如下:
启动命令如下:
报错信息如下:
我看过torch这部分的源码,有一个timeout的参数,但是关键在于如何在不修改源码的情况下从外部传入这个参数呢?
麻烦作者大大解答
Others
No response
The text was updated successfully, but these errors were encountered: