Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(common): adds retry mechanism for build script npm ci calls #11451

Merged
merged 9 commits into from
Jun 6, 2024

Conversation

jahorton
Copy link
Contributor

@jahorton jahorton commented May 15, 2024

Fixes #10350.

I've tested the new reattempt_if_failing utility function on its own with the following two calls:

  • reattempt_if_failing echo "this should pass"
    • passes immediately, no retries
  • reattempt_if_failing cd "not-a-directory"
    • fails every time (cd: invalid path) as intended, properly delays between attempts

@keymanapp-test-bot skip

@jahorton jahorton requested a review from mcdurdin as a code owner May 15, 2024 05:14
@keymanapp-test-bot keymanapp-test-bot bot added this to the A18S2 milestone May 15, 2024
Copy link
Member

@mcdurdin mcdurdin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good. Can we add function documentation at the front, similar to that found in builder.inc.sh, e.g.

#
# Returns `0` if first parameter is in the array passed as second parameter,
# where the array may contain globs.
#
# ### Parameters
#
# * 1: `item` item to search for in array
# * 2: `array` bash array, e.g. `array=(one two three)`
#
# ### Example
#
# ```bash
# array=(foo bar it*)
# if _builder_item_in_glob_array "item" "${array[@]}"; then ...; fi
# ```
#
_builder_item_in_glob_array() {

resources/shellHelperFunctions.sh Outdated Show resolved Hide resolved
resources/shellHelperFunctions.sh Outdated Show resolved Hide resolved
resources/shellHelperFunctions.sh Outdated Show resolved Hide resolved
resources/shellHelperFunctions.sh Outdated Show resolved Hide resolved
resources/shellHelperFunctions.sh Outdated Show resolved Hide resolved
@jahorton jahorton requested a review from mcdurdin May 15, 2024 07:28
resources/shellHelperFunctions.sh Outdated Show resolved Hide resolved
resources/shellHelperFunctions.sh Outdated Show resolved Hide resolved
@mcdurdin
Copy link
Member

Just discovered https://docs.npmjs.com/cli/v6/using-npm/config#fetch-retries and its friends:

fetch-retries
Default: 2
Type: Number
The "retries" config for the retry module to use when fetching packages from the registry.

fetch-retry-factor
Default: 10
Type: Number
The "factor" config for the retry module to use when fetching packages.

fetch-retry-mintimeout
Default: 10000 (10 seconds)
Type: Number
The "minTimeout" config for the retry module to use when fetching packages.

fetch-retry-maxtimeout
Default: 60000 (1 minute)
Type: Number
The "maxTimeout" config for the retry module to use when fetching packages.

maxsockets
Default: 50
Type: Number
The maximum number of connections to use per origin (protocol/host/port combination). Passed to the http Agent used to make the request.


We might wish to consider these because they'll be much cleaner than retrying the whole npm ci?

@mcdurdin
Copy link
Member

Also noting that the problem is not resolved with this PR, see https://build.palaso.org/buildConfiguration/Keyman_Common_KPAPI_TestPullRequests_macOS/463600

17:42:47   [common/tools/hextobin] ## configure starting...
17:42:48   npm WARN EBADENGINE Unsupported engine {
17:42:48   npm WARN EBADENGINE   package: undefined,
17:42:48   npm WARN EBADENGINE   required: { node: '^18.x' },
17:42:48   npm WARN EBADENGINE   current: { node: 'v21.7.3', npm: '10.5.1' }
17:42:48   npm WARN EBADENGINE }
17:42:48   npm WARN skipping integrity check for git dependency ssh://[email protected]/keymanapp/dependency-node-xml2js.git
17:42:48   npm WARN skipping integrity check for git dependency ssh://[email protected]/keymanapp/dependency-restructure.git
17:42:48   npm WARN skipping integrity check for git dependency ssh://[email protected]/keymanapp/dependency-restructure.git
17:42:49   npm WARN deprecated @npmcli/[email protected]: This functionality has been moved to @npmcli/fs
17:43:17   npm ERR! code 1
17:43:17   npm ERR! git dep preparation failed
17:43:17   npm ERR! command /opt/homebrew/Cellar/node/21.7.3/bin/node /opt/homebrew/lib/node_modules/npm/bin/npm-cli.js install --force --cache=/Users/marc_durdin/.npm --prefer-offline=false --prefer-online=false --offline=false --no-progress --no-save --no-audit --include=dev --include=peer --include=optional --no-package-lock-only --no-dry-run
17:43:17   npm ERR! npm WARN using --force Recommended protections disabled.
17:43:17   npm ERR! npm ERR! code ECONNRESET
17:43:17   npm ERR! npm ERR! errno ECONNRESET
17:43:17   npm ERR! npm ERR! network Invalid response body while trying to fetch https://registry.npmjs.org/@parcel%2fgraph: aborted
17:43:17   npm ERR! npm ERR! network This is a problem related to network connectivity.
17:43:17   npm ERR! npm ERR! network In most cases you are behind a proxy or have bad network settings.
17:43:17   npm ERR! npm ERR! network
17:43:17   npm ERR! npm ERR! network If you are behind a proxy, please make sure that the
17:43:17   npm ERR! npm ERR! network 'proxy' config is set properly.  See: 'npm help config'
17:43:17   npm ERR!
17:43:17   npm ERR! npm ERR! A complete log of this run can be found in: /Users/marc_durdin/.npm/_logs/2024-05-16T10_42_51_434Z-debug-0.log
17:43:17   
17:43:17   npm ERR! A complete log of this run can be found in: /Users/marc_durdin/.npm/_logs/2024-05-16T10_42_47_808Z-debug-0.log
17:43:17   [common/tools/hextobin] Expected output: 'node_modules'.

That last line is suspicious ... is the retry code working correctly?

@mcdurdin mcdurdin modified the milestones: A18S2, A18S3 May 24, 2024
Copy link
Member

@mcdurdin mcdurdin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With my previous comment -- I don't think this is working correctly

@jahorton
Copy link
Contributor Author

jahorton commented May 29, 2024

Just discovered https://docs.npmjs.com/cli/v6/using-npm/config#fetch-retries and its friends:

[...]

Following up on the quoted comment... viewing its link shows the following:

The "retries" config for the retry module to use when fetching packages from the registry.

Based on that, I believe that those settings affect npm's use of the retry package as seen at https://www.npmjs.com/package/retry.

factor: The exponential factor to use. Default is 2.

[...]

The formula used to calculate the individual timeouts is:

Math.min(random * minTimeout * Math.pow(factor, attempt), maxTimeout)

@jahorton
Copy link
Contributor Author

jahorton commented May 29, 2024

Inspecting that build log, I don't think npm's retry module even kicked in. That, or there are custom defaults set on the agent that provide less retries than npm's default.

17:42:47   [common/tools/hextobin] ## configure starting...
17:42:48   npm WARN EBADENGINE Unsupported engine {
17:42:48   npm WARN EBADENGINE   package: undefined,
17:42:48   npm WARN EBADENGINE   required: { node: '^18.x' },
17:42:48   npm WARN EBADENGINE   current: { node: 'v21.7.3', npm: '10.5.1' }
17:42:48   npm WARN EBADENGINE }
17:42:48   npm WARN skipping integrity check for git dependency ssh://[email protected]/keymanapp/dependency-node-xml2js.git
17:42:48   npm WARN skipping integrity check for git dependency ssh://[email protected]/keymanapp/dependency-restructure.git
17:42:48   npm WARN skipping integrity check for git dependency ssh://[email protected]/keymanapp/dependency-restructure.git
17:42:49   npm WARN deprecated @npmcli/[email protected]: This functionality has been moved to @npmcli/fs
17:43:17   npm ERR! code 1
17:43:17   npm ERR! git dep preparation failed
17:43:17   npm ERR! command /opt/homebrew/Cellar/node/21.7.3/bin/node /opt/homebrew/lib/node_modules/npm/bin/npm-cli.js install --force --cache=/Users/marc_durdin/.npm --prefer-offline=false --prefer-online=false --offline=false --no-progress --no-save --no-audit --include=dev --include=peer --include=optional --no-package-lock-only --no-dry-run
17:43:17   npm ERR! npm WARN using --force Recommended protections disabled.
17:43:17   npm ERR! npm ERR! code ECONNRESET
17:43:17   npm ERR! npm ERR! errno ECONNRESET
17:43:17   npm ERR! npm ERR! network Invalid response body while trying to fetch https://registry.npmjs.org/@parcel%2fgraph: aborted
[...]

Note the elapsed time: approximately 30 seconds in total.

The retry parameterization quoted above raises questions.

  • Default retry-factor of 2: asserts that after an initial failure, two retry attempts were made.
  • fetch-retry-factor of 10: asserts that the second retry's timeouts should be multiplied by 10.
  • fetch-retry-mintimeout of 10000 (10 seconds)

Shouldn't there be at least 110 seconds between initial npm call and the final, reported failure? mintimeout x factor seem to suggest this. retry's documentation says that the "min" and "max" timeouts are supposed to be a delay between tries, rather than limits for the retry's elapsed time. So... why did it it take only about 30 seconds to fail? Is there a custom .npmrc already on the build agents that's set more aggressively?

@mcdurdin
Copy link
Member

Is there a custom .npmrc already on the build agents that's set more aggressively?

No.

@mcdurdin
Copy link
Member

mcdurdin commented May 30, 2024

https://build.palaso.org/buildConfiguration/Keyman_Test_Common_Windows/466721 is another example where npm retry may help -- in this case, not network related, but local fs related (possibly security software?):

13:39:39   npm WARN cleanup     [Error: EPERM: operation not permitted, rmdir 'C:\BuildAgent\work\99b311828f4ee7c\keyman\node_modules\@microsoft\tsdoc-config\lib\__tests__'] {
13:39:39   npm WARN cleanup       errno: -4048,
13:39:39   npm WARN cleanup       code: 'EPERM',
13:39:39   npm WARN cleanup       syscall: 'rmdir',
13:39:39   npm WARN cleanup       path: 'C:\\BuildAgent\\work\\99b311828f4ee7c\\keyman\\node_modules\\@microsoft\\tsdoc-config\\lib\\__tests__'
13:39:39   npm WARN cleanup     }
13:39:39   npm WARN cleanup   ]
13:39:39   npm WARN cleanup ]
13:39:39   npm ERR! code 1
13:39:39   npm ERR! git dep preparation failed
13:39:39   npm ERR! command C:\Program Files\nodejs\node.exe C:\Users\bob\global_node\node_modules\npm\bin\npm-cli.js install --force --cache=C:\Users\bob\AppData\Local\npm-cache --prefer-offline=false --prefer-online=false --offline=false --no-progress --no-save --no-audit --include=dev --include=peer --include=optional --no-package-lock-only --no-dry-run
13:39:39   npm ERR! npm WARN using --force Recommended protections disabled.
13:39:39   npm ERR! npm ERR! code EBUSY
13:39:39   npm ERR! npm ERR! syscall open
13:39:39   npm ERR! npm ERR! path C:\Users\bob\AppData\Local\npm-cache\_cacache\index-v5\77\5e\7e30208cca13688147d134de1109e20935222521f11e1ceef8001daae2ed
13:39:39   npm ERR! npm ERR! errno EBUSY
13:39:39   npm ERR! npm ERR! Invalid response body while trying to fetch https://registry.npmjs.org/ansi-colors: EBUSY: resource busy or locked, open 'C:\Users\bob\AppData\Local\npm-cache\_cacache\index-v5\77\5e\7e30208cca13688147d134de1109e20935222521f11e1ceef8001daae2ed'
13:39:39   npm ERR!
13:39:39   npm ERR! npm ERR! A complete log of this run can be found in: C:\Users\bob\AppData\Local\npm-cache\_logs\2024-05-29T06_36_26_705Z-debug-0.log

and https://build.palaso.org/buildConfiguration/Keyman_Developer_Test/466684 also

@jahorton
Copy link
Contributor Author

jahorton commented May 31, 2024

Trying it out locally with a temporary script...

function inner_test ( ) {
  npm install @keymanapp/totally-not-a-package-that-is-distributed-so-it-should-make-an-error
}

try_multiple_times inner_test

I get the following if I disconnect from the internet mid-install:

$ ./temp.sh
npm ERR! code E404
npm ERR! 404 Not Found - GET https://registry.npmjs.org/@keymanapp%2ftotally-not-a-package-that-is-distributed-so-it-should-make-an-error - Not found
npm ERR! 404
npm ERR! 404  '@keymanapp/totally-not-a-package-that-is-distributed-so-it-should-make-an-error@*' is not in this registry.
npm ERR! 404
npm ERR! 404 Note that you can also install from a
npm ERR! 404 tarball, folder, http url, or git url.

npm ERR! A complete log of this run can be found in: C:\Users\User\AppData\Local\npm-cache\_logs\2024-05-31T07_17_09_049Z-debug-0.log
Delaying 39 seconds before attempt 2: `inner_test`
npm ERR! code ENOTFOUND
npm ERR! syscall getaddrinfo
npm ERR! errno ENOTFOUND
npm ERR! network request to https://registry.npmjs.org/@keymanapp%2ftotally-not-a-package-that-is-distributed-so-it-should-make-an-error failed, reason: getaddrinfo ENOTFOUND registry.npmjs.org
npm ERR! network This is a problem related to network connectivity.
npm ERR! network In most cases you are behind a proxy or have bad network settings.
npm ERR! network
npm ERR! network If you are behind a proxy, please make sure that the
npm ERR! network 'proxy' config is set properly.  See: 'npm help config'

npm ERR! A complete log of this run can be found in: C:\Users\User\AppData\Local\npm-cache\_logs\2024-05-31T07_17_50_553Z-debug-0.log
Delaying 77 seconds before attempt 3: `inner_test`
[...]

It's not an ECONNRESET, but we are getting a similar message. Of course, disconnecting from the internet would be "bad network settings".

Also of note: the retry script totally worked with npm's error outputs in this case - on both my Windows and macOS machines.

@jahorton
Copy link
Contributor Author

Ah, inserting the invalid npm install command in place of the npm ci call did the trick - it appears that the surrounding pushd/popd were affecting things. Come to think of it, if the npm ci call failed, it wouldn't have gotten to do the popd afterward.

@jahorton
Copy link
Contributor Author

jahorton commented May 31, 2024

Well, the good news is... the retry mechanism definitely works now.

https://build.palaso.org/buildConfiguration/Keyman_Test_Common_Web/467970?buildTab=log&focusLine=170&logView=linear&linesState=102

Might need a spot of cleanup after failed attempts, though.

15:12:49   [common/web/keyman-version] Delaying 52 seconds before attempt 2: `npm ci`
15:13:43   npm WARN skipping integrity check for git dependency ssh://[email protected]/keymanapp/dependency-node-xml2js.git
15:13:43   npm WARN skipping integrity check for git dependency ssh://[email protected]/keymanapp/dependency-restructure.git
15:13:43   npm WARN skipping integrity check for git dependency ssh://[email protected]/keymanapp/dependency-restructure.git
15:13:50   npm WARN deprecated @npmcli/[email protected]: This functionality has been moved to @npmcli/fs
15:14:54   [common/web/keyman-version] Delaying 55 seconds before attempt 3: `npm ci`
15:14:54   /var/lib/TeamCity/work/99b311828f4ee7c/keyman/resources/shellHelperFunctions.sh: line 196: 3974199 Killed                  "$@"
15:15:51   npm ERR! code ENOTEMPTY
15:15:51   npm ERR! syscall rename
15:15:51   npm ERR! path /var/lib/TeamCity/work/99b311828f4ee7c/keyman/web/node_modules/typescript
15:15:51   npm ERR! dest /var/lib/TeamCity/work/99b311828f4ee7c/keyman/web/node_modules/.typescript-FshlhPnn
15:15:51   npm ERR! errno -39
15:15:51   npm ERR! ENOTEMPTY: directory not empty, rename '/var/lib/TeamCity/work/99b311828f4ee7c/keyman/web/node_modules/typescript' -> '/var/lib/TeamCity/work/99b311828f4ee7c/keyman/web/node_modules/.typescript-FshlhPnn'
15:15:51   
15:15:51   npm ERR! A complete log of this run can be found in: /home/bob/.npm/_logs/2024-05-31T08_15_49_718Z-debug-0.log

And... wait, what's this about a web/ specific node_modules/typescript? Sure enough, the package.json for web/ is oddly showing the older version that we dropped. Will fix that in the 🔩 PR set (new commit on #11464).

@jahorton
Copy link
Contributor Author

jahorton commented Jun 3, 2024

Noted in discussion: it appears that our Linux BAs may be hitting out-of-memory due to VM restrictions that are then triggering the active npm ci call to be killed. That could easily leave the file-system updates halfway, which would then lead to the ENOTEMPTY error that followed in the prior log.

This was noticed via cross-reference with https://build.palaso.org/buildConfiguration/Keyman_Test_Common_Linux/467596?buildTab=log&linesState=86&logView=flowAware&focusLine=102, which reported an error code of 137, which is indicative of memory issues.

Copy link
Member

@mcdurdin mcdurdin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted in discussion: it appears that our Linux BAs may be hitting out-of-memory due to VM restrictions that are then triggering the active npm ci call to be killed. That could easily leave the file-system updates halfway, which would then lead to the ENOTEMPTY error that followed in the prior log.

The exit code for OOM kill is 137 (https://stackoverflow.com/questions/53245385/npm-gets-killed-or-errno-137) so if we captured that we could do a cleanup of node_modules by wrapping the npm ci into yet another function.

sleep $wait_length
fi

if ! "$@"; then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if ! "$@"; then
if ! "$@"; then
builder_echo "Command failed with error $?"

If would be really helpful to capture the error code here and report it. (Need to test that the exit code is not already lost; if it is then we'll need a slightly more complicated solution).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should've checked it locally first:

[web/src/app/browser] Command failed with error 0

@jahorton jahorton requested a review from mcdurdin June 3, 2024 08:19
Copy link
Member

@mcdurdin mcdurdin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jahorton jahorton merged commit e9ccd6f into master Jun 6, 2024
26 checks passed
@jahorton jahorton deleted the chore/common/retry-npm-ci branch June 6, 2024 01:10
@keyman-server
Copy link
Collaborator

Changes in this pull request will be available for download in Keyman version 18.0.50-alpha

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

chore(common): add retry checks around npm install (and any other npm network activities)
3 participants