Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using this package to run tests with github actions is buggy #40

Closed
joewagner opened this issue Jun 6, 2023 · 11 comments · May be fixed by #30
Closed

Using this package to run tests with github actions is buggy #40

joewagner opened this issue Jun 6, 2023 · 11 comments · May be fixed by #30
Assignees
Labels
bug Something isn't working linear Sync issue with linear

Comments

@joewagner
Copy link
Contributor

what happens

Running tests on a developer machine always works as expected, but when the tests are run as part of CI via github actions they fail about half the time.

how to reproduce

For an Example look at: https://github.com/tablelandnetwork/jeti/actions/runs/5191892282/jobs/9360322039
You will see that the network logs are turned on and the tests start running before the Validator has created the healthbot table. Some of the tests pass in this case, but the tests that create tables end up failing because the validator transaction receipt polling is aborted after the polling timeout is reached.
As a means of exploring why this is happening I had the test setup wait for a full 60 seconds after the local-tableland network signaled that it is ready. The same intermittent failures are still occurring.
I have a few theories why this is happening:

  • The Validator process is failing before the node.js process error listener has been attached correctly.
  • The Validator startup process is still taking place when the local-tableland parent process is signaling the network is ready.
  • The Validator is starting, but being overwhelmed with polling requests and smart contract events.

What is expected

This package should work correctly for CI done via github actions

Additional context

The parent process signals that the network is ready by inspecting the Validator process stdout, and waiting for a specific message. That message is currently the string "processing height". This is brittle at best, and we should consider having the Validator log a specific message when it considers itself to be fully online.

@joewagner
Copy link
Contributor Author

After some debugging, it seems that this problem is caused by an unknown issue that is causing the registry contract to not be deployed.

@joewagner
Copy link
Contributor Author

@dtbuchholz it looks like tablelandnetwork/local-tableland#433 did not fix this issue unfortunately. It still looks like the deploy process is silently crashing, or never starting?
here's CI run with logging: https://github.com/tablelandnetwork/js-validator/actions/runs/5684229078/job/15406459448

@dtbuchholz
Copy link
Contributor

@joewagner dang...this is frustrating! I can try tweaking some things next week and maybe configure a subset of separate tests to work with act (for local debugging...its usage if docker wasn't working properly with tests).

A couple of things came up through some chatgpt convos. First, maybe the cache is out of date or corrupted? Since things start properly, I doubt this is the root cause but figured I'd mention it. We could temporarily try sort of flushing it by adding a version to the run, like:

key: ${{ runner.os }}-node-${{ hashFiles('**/package-lock.json') }}-v2

The other point was on concurrency. Could there be an issue with shared resources? I'd also not expect this to be an issue since it's just a single validator running. But, for example, if there was a way to catch the actual validator's error that happens when you run two LT instances with the same config on different registry ports.

Last thought. When I was debugging, I threw in a shelljs echo command with lsof -i tcp:8545, during registry and validator process shutdown iirc. Most of the time, there were 3 pids logged. But, every once in a while, I'd unpredictably see like 10-15 shown. I don't recall any impacts, though, so just sharing the observation.

Other than that, I wonder if there's something unknown with hardhat. A last ditch effort could be to replace it with forge, which I've only heard great things about for DX.

@joewagner
Copy link
Contributor Author

joewagner commented Jul 28, 2023

Definitely all good thoughts.
Just to reiterate what I've found in debugging:
The symptom is specifically the deploy process, i.e. npx hardhat run scripts/deploy.ts. When a github failure occurs, the deploy process, (which is a synchronous child_process) logs the following:

[Contract Deploy] 
[Contract Deploy] Downloading compiler 0.8.19
[Contract Deploy] 

Then the process exits without an error, which is obviously not super helpful. The hardhat node starts without any issue, and the Validator starts and connects to hardhat. But since the contract doesn't exist the validator can't be of much use, and then the tests fail, (mostly with polling timeouts).

The main take home from that is that since hardhat is starting, and the validator connects to it, i don't think the issue has to do with the port being already in use. It definitely could have to do with that, but it doesn't seem to be the problem in my tests.

@dtbuchholz
Copy link
Contributor

@joewagner would CircleCI help with debugging this? i have some free credits that i can get if it's relevant.

@joewagner
Copy link
Contributor Author

joewagner commented Aug 11, 2023

@dtbuchholz Maybe!? That definitely looks promising. I found an Action that starts an ssh server on the machine running the workflows, it seems like it would be helpful too. https://github.com/marketplace/actions/debugging-with-ssh
I used it to fix an unrelated issue with the monorepo.

@dtbuchholz
Copy link
Contributor

hmm interesting. kk i'll get the credits and check that action out, too.

@joewagner
Copy link
Contributor Author

I was able to get more information on the registry deploy process error.
After adding #22, following is now being logged

Couldn't download compiler version 0.8.19+commit.7dd6d404: Checksum verification failed.
@tableland/sdk: Please check your internet connection and try again.
@tableland/sdk: If this error persists, run "npx hardhat clean --global".
@tableland/sdk: HardhatError: HH[50](https://github.com/tablelandnetwork/tableland-js/actions/runs/5999749627/job/16270493927#step:7:51)3: Couldn't download compiler version 0.8.19+commit.7dd6d404: Checksum verification failed.

@dtbuchholz
Copy link
Contributor

@joewagner no way! if the issue is due to proxying, we could try one of these out (via here):

// hardhat.config.ts
const { ProxyAgent, setGlobalDispatcher } = require("undici");
const proxyAgent = new ProxyAgent('http://127.0.0.1:7890'); // change to yours
setGlobalDispatcher(proxyAgent);

or via env vars:

export HTTP_PROXY=<username>:<password>@<ip_address>:<ip_port>
export HTTPS_PROXY=<username>:<password>@<ip_address>:<ip_port>

@joewagner
Copy link
Contributor Author

if the issue is due to proxying, we could try one of these out (via here):

It looks like the error is HardhatError: HH503 which is described here. The suggested solution is to run hardhat clean --global. I'll try that in a PR, unfortunately we can't really be sure it will fix anything since the failure is intermittent...

@dtbuchholz
Copy link
Contributor

(moving to #96 for Linear syncing puropses)

@dtbuchholz dtbuchholz closed this as not planned Won't fix, can't repro, duplicate, stale Nov 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working linear Sync issue with linear
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants