Weird nonlinear convergence in parallel #284

mrp089 · 2024-10-04T12:26:20Z

Description

While upgrading macOS (#279), we discovered that the Newtonian fluid test failed on three procs, but passed on one and four. Those are the test outputs:

test_newtonian[1]
 NS 1-1  1.390e-01  [0 1.000e+00 1.000e+00 9.108e-13]  [114 -277 19]
 NS 1-2  2.670e-01  [-82 7.699e-05 7.699e-05 2.604e-09]  [89 -198 13]
 NS 1-3  3.870e-01  [-176 1.458e-09 1.458e-09 1.312e-04]  [69 -89 9]
 NS 1-4s 5.050e-01  [-254 1.945e-13 1.945e-13 1.000e+00]  !0 0 0!  WARNING: The linear system solution has not converged

test_newtonian[3]
 NS 1-1  7.600e-02  [0 1.000e+00 1.000e+00 8.948e-13]  [114 -277 17]
 NS 1-2  1.760e-01  [-82 7.699e-05 7.699e-05 2.886e-09]  [201 -197 37]
 NS 1-3  2.610e-01  [-68 3.774e-04 3.774e-04 6.289e-10]  [155 -212 27]
 NS 1-4s 3.370e-01  [-119 1.037e-06 1.037e-06 1.886e-07]  [80 -155 11]

test_newtonian[4]
 NS 1-1  5.900e-02  [0 1.000e+00 1.000e+00 9.066e-13]  [114 -277 19]
 NS 1-2  1.140e-01  [-82 7.698e-05 7.698e-05 2.893e-09]  [87 -197 13]
 NS 1-3  1.660e-01  [-176 1.444e-09 1.444e-09 1.292e-04]  [69 -90 10]
 NS 1-4s 2.190e-01  [-254 1.912e-13 1.912e-13 1.000e+00]  !0 0 0!  WARNING: The linear system solution has not converged

One and four procs reach the nonlinear tolerance of 1e-11 within 4 Newton iterations specified when setting up the test (as documented in "Create a new test"). For three procs, it looks like the first two Newton iterations, 1-1 and 1-2, are identical to one and four procs.

However, the three-procs case exhibits much worse nonlinear convergence for steps 1-3 and 1-4. The linear convergence is identical for the first Newton iteration, where the initial solution is identical.

Reproduction

It looks like in many test executions, the result with Ri/R1 ~ 1e-6 was still good enough to pass until this macos-latest pipeline came along. For example, this pipeline (and potentially others) have the convergence problem but still passed the integration test.

Expected behavior

Convergence behavior should be identical on one, three, and four procs, except for the minor differences visible between one and four.

Additional context

I suspect this could be a parallel bug in the linearization (since the result still passes the integration test with bad convergence). It shouldn't be the linear solver since the first Newton iteration (where the solution is still identical) has identical linear performance.

A first debugging step could be looking through recent pipelines and seeing if this problem appears consistently. Then, depending on what past pipelines show, it would make sense to isolate it by physics type and other properties.

Code of Conduct

I agree to follow this project's Code of Conduct and Contributing Guidelines

The text was updated successfully, but these errors were encountered:

mrp089 · 2024-10-04T12:27:12Z

All assign @aabrown100-git since he was interested in looking into this (thank you!!). I'm happy to have a conversation here and figure out what may be causing this!

aabrown100-git · 2024-10-04T17:15:27Z

My working theory was some kind of undefined behavior due to a tangent contribution being uninitialized. However, I looked through the fluid.cpp and couldn't find any evidence of that. Moreover, I ran the Newtonian test case multiple times for 100 timesteps with 1-proc (also did the same with 3-procs), and the convergence history was identical between subsequent runs. If there some kind of uninitialized array issue, I would expect the convergence history to change each time.

I'm not sure it's a parallel bug, because I saw poor convergence with 1-proc in the 3rd timestep

---------------------------------------------------------------------
 Eq     N-i     T       dB  Ri/R1   Ri/R0    R/Ri     lsIt   dB  %t
---------------------------------------------------------------------
 NS 1-1  4.800e-02  [0 1.000e+00 1.000e+00 7.438e-13]  [115 -279 56]
 NS 1-2  8.600e-02  [-82 7.702e-05 7.702e-05 3.112e-09]  [87 -196 47]
 NS 1-3  1.210e-01  [-176 1.462e-09 1.462e-09 1.309e-04]  [69 -89 29]
 NS 1-4s 1.500e-01  [-254 1.978e-13 1.978e-13 1.000e+00]  !0 0 0!  WARNING: The linear system solution has not converged
 NS 2-1  2.210e-01  [0 1.000e+00 1.307e+01 7.156e-13]  [113 -280 34]
 NS 2-2  3.520e-01  [-99 1.056e-05 1.380e-04 1.488e-09]  [256 -37 83]
 NS 2-3  3.830e-01  [-194 1.805e-10 2.359e-09 1.000e-04]  [67 -92 35]
 NS 2-4s 4.130e-01  [-274 1.839e-14 2.404e-13 1.000e+00]  !0 0 0!  WARNING: The linear system solution has not converged
 NS 3-1  5.130e-01  [0 1.000e+00 9.815e+00 9.743e-13]  [171 -277 56]
 NS 3-2  6.420e-01  [-39 1.051e-02 1.032e-01 2.247e-12]  [227 -268 81]
 NS 3-3  6.970e-01  [-17 1.261e-01 1.238e+00 8.131e-13]  [105 -278 62]
 NS 3-4s 7.690e-01  [-107 4.132e-06 4.056e-05 6.220e-09]  [147 -189 61]

mrp089 added the bug Something isn't working label Oct 4, 2024

mrp089 mentioned this issue Oct 4, 2024

Upgrading macOS version in the test_macos.yml (addresses #278) #279

Merged

1 task

mrp089 assigned aabrown100-git Oct 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weird nonlinear convergence in parallel #284

Weird nonlinear convergence in parallel #284

mrp089 commented Oct 4, 2024

mrp089 commented Oct 4, 2024

aabrown100-git commented Oct 4, 2024

Weird nonlinear convergence in parallel #284

Weird nonlinear convergence in parallel #284

Comments

mrp089 commented Oct 4, 2024

Description

Reproduction

Expected behavior

Additional context

Code of Conduct

mrp089 commented Oct 4, 2024

aabrown100-git commented Oct 4, 2024