You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
I used an overlay network to connect containers on two hosts for communication, and configured passwordless SSH along with the relevant /etc/hosts and hostfile. However, I was unable to start training with the command deepspeed --hostfile hostfile --num_nodes 2 --num_gpus 1 test.py . After checking deepspeed.ai, I found that I can start training using the "Launching without passwordless SSH" method with the command deepspeed --hostfile=hostfile --no_ssh --node_rank=0 --master_addr=10.0.1.13 test.py I would like to know what is causing this issue.
These are the log from my training and some configurations.
`root@903c1e9c351c:/home/user/code# deepspeed --hostfile hostfile --num_nodes 2 --num_gpus 1 test.py
[2024-12-16 07:28:11,223] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/deepspeed/launcher/runner.py", line 470, in main
subprocess.check_call(safe_ssh_cmd, stderr=subprocess.DEVNULL, stdout=subprocess.DEVNULL)
File "/usr/lib/python3.10/subprocess.py", line 369, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['ssh', '-o', 'PasswordAuthentication=no', 'manager', 'hostname']' returned non-zero exit status 255.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/bin/deepspeed", line 6, in
main()
File "/usr/local/lib/python3.10/dist-packages/deepspeed/launcher/runner.py", line 472, in main
raise RuntimeError(
RuntimeError: Using hostfile at hostfile but host=manager was not reachable via ssh. If you are running with a single node please remove hostfile or setup passwordless ssh.`
Thanks for raising up this question. I would suggest still using deepspeed launcher with ssh enabled. To make it work cross containers on different hosts. You should do following (assuming 2 containers located in 2 separate hosts):
make sure you strictly follow docker overlay network setup here. One final thing to check, say you have two nodes, first is swarm host1, second node(host2) is join host1's swarm network. Then after you launch container on host2, you should see that test-net in tutorial show on host2 when you do docker network ls on host2.
after setup pub key across the two containers (i.e. authorized_keys). you should try container1 ssh to continar2 (and vice versa) see if it is working. if this part is not working, it is highly possible that:
2.1 access permissions (i.e. chmod) on .ssh folder or authorized_keys are wrong.
2.2 you did not setup correctly on /etc/ssh/sshd_config, where you should make
port 22
PermitRootLogin yes
PubkeyAuthentication yes
and then service ssh restart on both containers
after above 2 steps working correctly, Then you should be able to use deepspeed launcher with ssh enabled.
Describe the bug
I used an overlay network to connect containers on two hosts for communication, and configured passwordless SSH along with the relevant /etc/hosts and hostfile. However, I was unable to start training with the command
deepspeed --hostfile hostfile --num_nodes 2 --num_gpus 1 test.py
. After checking deepspeed.ai, I found that I can start training using the "Launching without passwordless SSH" method with the commanddeepspeed --hostfile=hostfile --no_ssh --node_rank=0 --master_addr=10.0.1.13 test.py
I would like to know what is causing this issue.These are the log from my training and some configurations.
`root@903c1e9c351c:/home/user/code# deepspeed --hostfile hostfile --num_nodes 2 --num_gpus 1 test.py
[2024-12-16 07:28:11,223] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/deepspeed/launcher/runner.py", line 470, in main
subprocess.check_call(safe_ssh_cmd, stderr=subprocess.DEVNULL, stdout=subprocess.DEVNULL)
File "/usr/lib/python3.10/subprocess.py", line 369, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['ssh', '-o', 'PasswordAuthentication=no', 'manager', 'hostname']' returned non-zero exit status 255.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/bin/deepspeed", line 6, in
main()
File "/usr/local/lib/python3.10/dist-packages/deepspeed/launcher/runner.py", line 472, in main
raise RuntimeError(
RuntimeError: Using hostfile at hostfile but host=manager was not reachable via ssh. If you are running with a single node please remove hostfile or setup passwordless ssh.`
root@903c1e9c351c:/home/user/code# cat /etc/hosts 127.0.0.1 localhost ::1 localhost ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0 ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters 10.0.1.13 903c1e9c351c 10.0.1.13 manager 10.0.1.15 worker root@903c1e9c351c:/home/user/code# ifconfig eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1450 inet 10.0.1.13 netmask 255.255.255.0 broadcast 10.0.1.255 ether 02:42:0a:00:01:0d txqueuelen 0 (Ethernet) RX packets 512 bytes 78880 (78.8 KB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 471 bytes 79480 (79.4 KB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
manager slots=1 worker slots=1
The text was updated successfully, but these errors were encountered: