-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Negative FOM when Running Large MPI Counts #12
Comments
@nmhamster are you using master? If so I'm not sure how this output makes sense since the overall number and the elapsed time are the same value being printed out in the current code. The FOM is calculated off these, but its not clear where the bug would come from. Note there was a bug related to overflow in an older version of LULESH. If this is what you are hitting then I can help you fix this if you need to use that version for some reason. If you are using master can you give me inputs you are running and the full output. I can try and recreate if I can get access to a big enough resource quickly or try and recreate on a smaller node count with the same global problem size. |
OK, I am using the 2.0.3 release. I changed the output code at the end and it corrects the FOM (but note - I didn't change the number of elements print out earlier which is also overflowing. Fixed output example:
Changed the
|
I can submit a pull request as I get these fixed. |
@nmhamster these are fixed in master. Do you need a tagged release? If so I think the easiest solution would be for me to tag a new release since there are no open issues and you to move there. I'm open to other ideas, but its not worth your time fixing this. |
@nmhamster after some thought I think a new tagged release is overdue. I need to do a bit of performance testing to confirm no significant regression, but otherwise the code should be fine. |
We are scaling LULESH to large numbers of nodes (around 2000 nodes) with 8 MPI ranks per node and a problem size of 90. The result is that FOMs given go negative.
This probably shouldn't happen.
The text was updated successfully, but these errors were encountered: