Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird nonlinear convergence in parallel #284

Open
1 task done
mrp089 opened this issue Oct 4, 2024 · 2 comments
Open
1 task done

Weird nonlinear convergence in parallel #284

mrp089 opened this issue Oct 4, 2024 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@mrp089
Copy link
Member

mrp089 commented Oct 4, 2024

Description

While upgrading macOS (#279), we discovered that the Newtonian fluid test failed on three procs, but passed on one and four. Those are the test outputs:

test_newtonian[1]
 NS 1-1  1.390e-01  [0 1.000e+00 1.000e+00 9.108e-13]  [114 -277 19]
 NS 1-2  2.670e-01  [-82 7.699e-05 7.699e-05 2.604e-09]  [89 -198 13]
 NS 1-3  3.870e-01  [-176 1.458e-09 1.458e-09 1.312e-04]  [69 -89 9]
 NS 1-4s 5.050e-01  [-254 1.945e-13 1.945e-13 1.000e+00]  !0 0 0!  WARNING: The linear system solution has not converged

test_newtonian[3]
 NS 1-1  7.600e-02  [0 1.000e+00 1.000e+00 8.948e-13]  [114 -277 17]
 NS 1-2  1.760e-01  [-82 7.699e-05 7.699e-05 2.886e-09]  [201 -197 37]
 NS 1-3  2.610e-01  [-68 3.774e-04 3.774e-04 6.289e-10]  [155 -212 27]
 NS 1-4s 3.370e-01  [-119 1.037e-06 1.037e-06 1.886e-07]  [80 -155 11]

test_newtonian[4]
 NS 1-1  5.900e-02  [0 1.000e+00 1.000e+00 9.066e-13]  [114 -277 19]
 NS 1-2  1.140e-01  [-82 7.698e-05 7.698e-05 2.893e-09]  [87 -197 13]
 NS 1-3  1.660e-01  [-176 1.444e-09 1.444e-09 1.292e-04]  [69 -90 10]
 NS 1-4s 2.190e-01  [-254 1.912e-13 1.912e-13 1.000e+00]  !0 0 0!  WARNING: The linear system solution has not converged

One and four procs reach the nonlinear tolerance of 1e-11 within 4 Newton iterations specified when setting up the test (as documented in "Create a new test"). For three procs, it looks like the first two Newton iterations, 1-1 and 1-2, are identical to one and four procs.

However, the three-procs case exhibits much worse nonlinear convergence for steps 1-3 and 1-4. The linear convergence is identical for the first Newton iteration, where the initial solution is identical.

Reproduction

It looks like in many test executions, the result with Ri/R1 ~ 1e-6 was still good enough to pass until this macos-latest pipeline came along. For example, this pipeline (and potentially others) have the convergence problem but still passed the integration test.

Expected behavior

Convergence behavior should be identical on one, three, and four procs, except for the minor differences visible between one and four.

Additional context

I suspect this could be a parallel bug in the linearization (since the result still passes the integration test with bad convergence). It shouldn't be the linear solver since the first Newton iteration (where the solution is still identical) has identical linear performance.

A first debugging step could be looking through recent pipelines and seeing if this problem appears consistently. Then, depending on what past pipelines show, it would make sense to isolate it by physics type and other properties.

Code of Conduct

  • I agree to follow this project's Code of Conduct and Contributing Guidelines
@mrp089 mrp089 added the bug Something isn't working label Oct 4, 2024
@mrp089
Copy link
Member Author

mrp089 commented Oct 4, 2024

All assign @aabrown100-git since he was interested in looking into this (thank you!!). I'm happy to have a conversation here and figure out what may be causing this!

@aabrown100-git
Copy link
Collaborator

My working theory was some kind of undefined behavior due to a tangent contribution being uninitialized. However, I looked through the fluid.cpp and couldn't find any evidence of that. Moreover, I ran the Newtonian test case multiple times for 100 timesteps with 1-proc (also did the same with 3-procs), and the convergence history was identical between subsequent runs. If there some kind of uninitialized array issue, I would expect the convergence history to change each time.

I'm not sure it's a parallel bug, because I saw poor convergence with 1-proc in the 3rd timestep

---------------------------------------------------------------------
 Eq     N-i     T       dB  Ri/R1   Ri/R0    R/Ri     lsIt   dB  %t
---------------------------------------------------------------------
 NS 1-1  4.800e-02  [0 1.000e+00 1.000e+00 7.438e-13]  [115 -279 56]
 NS 1-2  8.600e-02  [-82 7.702e-05 7.702e-05 3.112e-09]  [87 -196 47]
 NS 1-3  1.210e-01  [-176 1.462e-09 1.462e-09 1.309e-04]  [69 -89 29]
 NS 1-4s 1.500e-01  [-254 1.978e-13 1.978e-13 1.000e+00]  !0 0 0!  WARNING: The linear system solution has not converged
 NS 2-1  2.210e-01  [0 1.000e+00 1.307e+01 7.156e-13]  [113 -280 34]
 NS 2-2  3.520e-01  [-99 1.056e-05 1.380e-04 1.488e-09]  [256 -37 83]
 NS 2-3  3.830e-01  [-194 1.805e-10 2.359e-09 1.000e-04]  [67 -92 35]
 NS 2-4s 4.130e-01  [-274 1.839e-14 2.404e-13 1.000e+00]  !0 0 0!  WARNING: The linear system solution has not converged
 NS 3-1  5.130e-01  [0 1.000e+00 9.815e+00 9.743e-13]  [171 -277 56]
 NS 3-2  6.420e-01  [-39 1.051e-02 1.032e-01 2.247e-12]  [227 -268 81]
 NS 3-3  6.970e-01  [-17 1.261e-01 1.238e+00 8.131e-13]  [105 -278 62]
 NS 3-4s 7.690e-01  [-107 4.132e-06 4.056e-05 6.220e-09]  [147 -189 61]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants