Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Cannot use --hostfile to start multi-node training in Docker. #6875

Open
Ind1x1 opened this issue Dec 16, 2024 · 1 comment
Open

[BUG] Cannot use --hostfile to start multi-node training in Docker. #6875

Ind1x1 opened this issue Dec 16, 2024 · 1 comment
Assignees
Labels
bug Something isn't working training

Comments

@Ind1x1
Copy link

Ind1x1 commented Dec 16, 2024

Describe the bug
I used an overlay network to connect containers on two hosts for communication, and configured passwordless SSH along with the relevant /etc/hosts and hostfile. However, I was unable to start training with the command deepspeed --hostfile hostfile --num_nodes 2 --num_gpus 1 test.py . After checking deepspeed.ai, I found that I can start training using the "Launching without passwordless SSH" method with the command
deepspeed --hostfile=hostfile --no_ssh --node_rank=0 --master_addr=10.0.1.13 test.py I would like to know what is causing this issue.

These are the log from my training and some configurations.
`root@903c1e9c351c:/home/user/code# deepspeed --hostfile hostfile --num_nodes 2 --num_gpus 1 test.py
[2024-12-16 07:28:11,223] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/deepspeed/launcher/runner.py", line 470, in main
subprocess.check_call(safe_ssh_cmd, stderr=subprocess.DEVNULL, stdout=subprocess.DEVNULL)
File "/usr/lib/python3.10/subprocess.py", line 369, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['ssh', '-o', 'PasswordAuthentication=no', 'manager', 'hostname']' returned non-zero exit status 255.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/bin/deepspeed", line 6, in
main()
File "/usr/local/lib/python3.10/dist-packages/deepspeed/launcher/runner.py", line 472, in main
raise RuntimeError(
RuntimeError: Using hostfile at hostfile but host=manager was not reachable via ssh. If you are running with a single node please remove hostfile or setup passwordless ssh.`

root@903c1e9c351c:/home/user/code# cat /etc/hosts 127.0.0.1 localhost ::1 localhost ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0 ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters 10.0.1.13 903c1e9c351c 10.0.1.13 manager 10.0.1.15 worker root@903c1e9c351c:/home/user/code# ifconfig eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1450 inet 10.0.1.13 netmask 255.255.255.0 broadcast 10.0.1.255 ether 02:42:0a:00:01:0d txqueuelen 0 (Ethernet) RX packets 512 bytes 78880 (78.8 KB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 471 bytes 79480 (79.4 KB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

manager slots=1 worker slots=1

@Ind1x1 Ind1x1 added bug Something isn't working training labels Dec 16, 2024
@GuanhuaWang
Copy link
Member

GuanhuaWang commented Dec 18, 2024

Hi @Ind1x1

Thanks for raising up this question. I would suggest still using deepspeed launcher with ssh enabled. To make it work cross containers on different hosts. You should do following (assuming 2 containers located in 2 separate hosts):

  1. make sure you strictly follow docker overlay network setup here. One final thing to check, say you have two nodes, first is swarm host1, second node(host2) is join host1's swarm network. Then after you launch container on host2, you should see that test-net in tutorial show on host2 when you do docker network ls on host2.
  2. after setup pub key across the two containers (i.e. authorized_keys). you should try container1 ssh to continar2 (and vice versa) see if it is working. if this part is not working, it is highly possible that:
    2.1 access permissions (i.e. chmod) on .ssh folder or authorized_keys are wrong.
    2.2 you did not setup correctly on /etc/ssh/sshd_config, where you should make
port 22
PermitRootLogin yes
PubkeyAuthentication yes 

and then service ssh restart on both containers

  1. after above 2 steps working correctly, Then you should be able to use deepspeed launcher with ssh enabled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working training
Projects
None yet
Development

No branches or pull requests

2 participants