Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors when running mpi programs #12520

Open
rafelamer opened this issue May 3, 2024 · 5 comments
Open

Errors when running mpi programs #12520

rafelamer opened this issue May 3, 2024 · 5 comments
Labels

Comments

@rafelamer
Copy link

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

The version of openmpi is 5.0.2

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

It was installed on Fedora 40 hosts with the command
dnf install openmpi openmpi-devel

I don't know if it is relevant, in Fedora 40 the openmpi library is linked to libfabric

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running

  • Operating system/version:
    Fedora 40
  • Computer hardware:
    Shared vCPU on hetzner.com
  • Network type:
    All the nodes have lo and eth0 interfaces

Details of the problem

I cannot run an mpi program on a 3-node cluster with ip addresses 195.201.223.246, 162.55.213.49 and 88.198.157.233
When I run

shell$ mpirun -np 16 --hostfile ~/hosts ./mpi02

I get errors of the form

mce-eseiaat.com:rank0:  Trying to connect from eth0 port 1 (subnet 195.201.223.246/32) to a node (IP=88.198.157.233/32 TCP=45575 UDP=54793) on a different subnet 88.198.157.233/32

mce-eseiaat.com:rank3:  Trying to connect from eth0 port 1 (subnet 195.201.223.246/32) to a node (IP=162.55.213.49/32 TCP=37449 UDP=41668) on a different subnet 162.55.213.49/32

mce-eseiaat.com:rank6:  Trying to connect from eth0 port 1 (subnet 195.201.223.246/32) to a node (IP=88.198.157.233/32 TCP=54837 UDP=47899) on a different subnet 88.198.157.233/32

mce-eseiaat.com:rank9:  Trying to connect from eth0 port 1 (subnet 195.201.223.246/32) to a node (IP=162.55.213.49/32 TCP=45297 UDP=39015) on a different subnet 162.55.213.49/32

mce-eseiaat.com:rank12:  Trying to connect from eth0 port 1 (subnet 195.201.223.246/32) to a node (IP=88.198.157.233/32 TCP=35391 UDP=33260) on a different subnet 88.198.157.233/32

mce-eseiaat.com:rank15:  Trying to connect from eth0 port 1 (subnet 195.201.223.246/32) to a node (IP=162.55.213.49/32 TCP=51183 UDP=52527) on a different subnet 162.55.213.49/32

worker2.mce-eseiaat.com:rank2:  Trying to connect from eth0 port 1 (subnet 162.55.213.49/32) to a node (IP=195.201.223.246/32 TCP=44503 UDP=49948) on a different subnet 195.201.223.246/32

worker2.mce-eseiaat.com:rank11:  Trying to connect from eth0 port 1 (subnet 162.55.213.49/32) to a node (IP=88.198.157.233/32 TCP=51443 UDP=36464) on a different subnet 88.198.157.233/32

worker2.mce-eseiaat.com:rank14:  Trying to connect from eth0 port 1 (subnet 162.55.213.49/32) to a node (IP=195.201.223.246/32 TCP=33317 UDP=40668) on a different subnet 195.201.223.246/32

worker1.mce-eseiaat.com:rank1:  Trying to connect from eth0 port 1 (subnet 88.198.157.233/32) to a node (IP=195.201.223.246/32 TCP=32781 UDP=53663) on a different subnet 195.201.223.246/32

worker1.mce-eseiaat.com:rank7:  Trying to connect from eth0 port 1 (subnet 88.198.157.233/32) to a node (IP=195.201.223.246/32 TCP=47383 UDP=51499) on a different subnet 195.201.223.246/32

worker2.mce-eseiaat.com:rank8:  Trying to connect from eth0 port 1 (subnet 162.55.213.49/32) to a node (IP=195.201.223.246/32 TCP=40591 UDP=38371) on a different subnet 195.201.223.246/32

worker1.mce-eseiaat.com:rank13:  Trying to connect from eth0 port 1 (subnet 88.198.157.233/32) to a node (IP=195.201.223.246/32 TCP=56965 UDP=38971) on a different subnet 195.201.223.246/32

worker1.mce-eseiaat.com:rank10:  Trying to connect from eth0 port 1 (subnet 88.198.157.233/32) to a node (IP=162.55.213.49/32 TCP=59499 UDP=38225) on a different subnet 162.55.213.49/32

worker2.mce-eseiaat.com:rank5:  Trying to connect from eth0 port 1 (subnet 162.55.213.49/32) to a node (IP=88.198.157.233/32 TCP=39797 UDP=49164) on a different subnet 88.198.157.233/32

worker1.mce-eseiaat.com:rank4:  Trying to connect from eth0 port 1 (subnet 88.198.157.233/32) to a node (IP=162.55.213.49/32 TCP=60669 UDP=38137) on a different subnet 162.55.213.49/32

The contents of the hosts file are

mce-eseiaat.com slots=8
worker1.mce-eseiaat.com slots=4
worker2.mce-eseiaat.com	slots=4

Best regards,
Rafel Amer

@wenduwan
Copy link
Contributor

wenduwan commented May 3, 2024

Is mpi02 on a shared NFS volume? It would be helpful to double check the linking

ldd mpi02

I don't know if it is relevant, in Fedora 40 the openmpi library is linked to libfabric

We can rule out libfabric with additional mca parameters

mpirun -np 16 --mca pml ob1 --mca btl tcp,self --hostfile ~/hosts ./mpi02

This prevents libfabric from being used.

@rafelamer
Copy link
Author

Hi,

with this command
mpirun -np 16 --mca pml ob1 --mca btl tcp,self --hostfile ~/hosts ./mpi02
it works fine. So, it seems that the problem is related to libfabric.

Thanks,
Rafel Amer

@wenduwan
Copy link
Contributor

wenduwan commented May 3, 2024

Thanks for checking. Just to clarify, do you intend to use libfabric at all?

I wonder how libfabric is configured on your system - we can move the discussion to the libfabric community if you desire so.

$ dnf list installed | grep libfabric
$ dnf info <libfabric package name>

@rafelamer
Copy link
Author

OK,
I will subscribe to the Libfabric-users mailing list and then, I will make a post.

Best regards,
Rafel Amer

@wenduwan
Copy link
Contributor

wenduwan commented May 3, 2024

The libfabric community would need more information to investigate the issue.

As a starter, you can turn on the relevant verbose configurations in mpirun

--mca btl_ofi_verbose 1 -x FI_LOG_LEVEL=info

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants