-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPI boundary issue #38
Comments
I have tried to regenerate those error several times. But it always behave correctly. I have used this batch scripts. #!/bin/bash srun -n8 -S16 --exclusive --cpus-per-task=1 --threads-per-core=1 --gpus-per-task=1 --gpu-bind=closest /autofs/nccs-svm1_home1/kc4/Software/build_crusher/ExaMPM/examples/DamBreak 0.01 2 3 0.00004 20.0 2500 hip |
From Mark on slack:
I've been using ExaMPM (DamBreak) for performance and profiling on one Crusher node and continue to see what look like communications errors leading to spurious new velocities at processor boundaries which then lead to numerical blow-up and crashes (almost always in
g2p->scatter->packBuffer
). This only occurs for problems over about 50^3 cells and 4 or more MPI ranks (I test withsrun -N1 -n8 -S16 --exclusive -t30:00 --cpus-per-task=1 --threads-per-core=1 --gpus-per-task=1 --gpu-bind=closest ./DamBreak 0.01 2 3 0.00004 10.0 2500 hip
) and does not go away with wider halos, evenly-divisible ny, different Y boundary conditions (periodic, slip, noslip), or different versions of Cabana (0.5.0, head). It seems to be suppressed somewhat with fewer particles per cell and by usingAMD_SANITIZE_KERNEL
andAMD_SANITIZE_COPY
vars, but never goes away. For a while, it seemed to always happen 2 or 3 time steps after a Silo write, but still occurs without any Silo writes. It can occur anywhere between steps 5000 and 100000.The text was updated successfully, but these errors were encountered: