-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues running with MPI #56
Comments
I'm also trying to build (and running into issues):
My cabana install directory:
Did I forget to build something? |
For the build issues, you just need a newer version of Cabana (0.6.1 or up to date master). I just opened a PR to give a clear error at configuration For the run, I went ahead and updated the wiki page - exactly as you guessed, you can just add |
Thank you! A quick follow up question (I'm not great at debugging MPI). I can confirm that I can ping the other host and can ssh into it from my launcher, but I'm getting an error. Here are details: Here are my hosts # cat hostlist.txt
metricset-sample-l-0-0.ms.default.svc.cluster.local
metricset-sample-w-0-0.ms.default.svc.cluster.local Ping works to the worker (w) node # ping metricset-sample-w-0-0.ms.default.svc.cluster.local
PING metricset-sample-w-0-0.ms.default.svc.cluster.local (10.244.0.16) 56(84) bytes of data.
64 bytes from metricset-sample-w-0-0.ms.default.svc.cluster.local (10.244.0.16): icmp_seq=1 ttl=63 time=0.097 ms
64 bytes from metricset-sample-w-0-0.ms.default.svc.cluster.local (10.244.0.16): icmp_seq=2 ttl=63 time=0.058 ms
64 bytes from metricset-sample-w-0-0.ms.default.svc.cluster.local (10.244.0.16): icmp_seq=3 ttl=63 time=0.050 ms
^C mpirun spits out this error # mpirun --hostfile ./hostlist.txt --allow-run-as-root -N 2 ./DamBreak 0.05 2 0 0.001 1.0 50 serial
ssh: Could not resolve hostname metricset-sample-w-0-0: Temporary failure in name resolution
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
-------------------------------------------------------------------------- ssh to the other host works too! root@metricset-sample-l-0-0:/opt/exaMPM/build/examples# ssh metricset-sample-w-0-0.ms.default.svc.cluster.local
Welcome to Ubuntu 22.04.3 LTS (GNU/Linux 6.2.0-37-generic x86_64)
* Documentation: https://help.ubuntu.com
* Management: https://landscape.canonical.com
* Support: https://ubuntu.com/advantage
This system has been minimized by removing packages and content that are
not required on a system that users do not log into.
To restore this content, you can run the 'unminimize' command.
The programs included with the Ubuntu system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.
Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
applicable law. The MPI I'm using (maybe the wrong one or version?) Thanks for the help! mpirun (Open MPI) 4.1.2
Usage: mpirun [OPTION]... [PROGRAM]...
Start the given program using Open RTE Thanks for your help! |
Looks to be an MPI configuration issue (that I don't think I can help with), but I can at least confirm that the version of MPI is something we test against regularly |
This is probably my stopping point for working on it then - I'm not sure what the problem above is (and I'm still inexperienced with MPI). For context I was going to add it to the metrics operator https://github.com/converged-computing/metrics-operator and use for converged computing experiments on Kubernetes, but I'll skip over it and move on to the next. Thanks! |
Now that I see you have a lammps case, maybe using exactly the same MPI call as what they use would make a difference here? Just a thought |
Thanks for the suggestion! The lammps metrics container uses mpich and the one here is openmpi, so I don't think we could do that. I did try mpich too (with the same command) and got a non-working result. |
Hi! I'm new to using this app, and was wondering if you have an example for running with mpirun (or similar?) I'm looking at the docs here: https://github.com/ECP-copa/ExaMPM/wiki/Run
Thank you! And apologies if this is an overly simple question (e.g., just put mpirun in front of that :P )
The text was updated successfully, but these errors were encountered: