-
Notifications
You must be signed in to change notification settings - Fork 521
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Verify Bazel CI test changes are working as expected #2691
Comments
Running on develop: https://github.com/oppia/oppia-android/actions/runs/559499045 (strange JVM crash failure, but tests otherwise look operational). |
Manual run kicked off: https://github.com/oppia/oppia-android/actions/runs/560109387. Seems to be working, but I'll check in just to make sure before checking the above. |
For the fork test, will need to wait for a new PR to be created to check. |
Ugh, looks like the check might be failing when no targets are computed: https://github.com/oppia/oppia-android/actions/runs/560112155. |
I think we can conclude that manually triggering the workflows does work, and that the no affected targets check is NOT working. I'm fairly confident once the underlying issue is fixed that the conclusion task should pass & thus work when its required, so I may preemptively check that once #2696 is verified as working & merged in. |
/cc @fsharpasharp in case you're interested in following along on this. |
Another issue: #2695. |
Confirming forks fail at least due to the key management piece: #2690. There's probably a check missing here. |
Per https://github.com/oppia/oppia-android/pull/2690/checks?check_run_id=1885145971 it looks like the fork check is failing unexpectedly. Those actions weren't supposed to run which is why they're failing. |
Interestingly https://github.com/oppia/oppia-android/pull/2320/checks?check_run_id=1886178465 ran successfully so the computed affected tests bit is not yet reliable. |
#2648 seems to show that Bazel file changes are causing the compute affected targets to hang: https://github.com/oppia/oppia-android/runs/1925361182. Adding this as another TODO. |
#2775 has another example of the affected targets issue: https://github.com/oppia/oppia-android/pull/2775/checks?check_run_id=2036846344:
Running the script locally did not yield issues, yet we see failures like the above happening on occasion. Seems to be a CI issue, so adding additional diagnostics to try and figure out why this is happening. |
Forks are now verified as working. Some tweaks have been made, but both forks & non-forks have been working well for the last few weeks. |
Also another update: CI should no longer hang on the compute affects tests issue since the empty matrix issue was fixed in #2696. |
CI runtime was down substantially with #3035 making it closer to viable to enable the tests as required. |
#2844 also needs to be resolved before we can enable Bazel tests, and it seems like recent changes might be working. |
One of the last remaining major issues is sometimes the Bazel query output has formatting issues, resulting in invalid test targets like:
I've debugged this a bit and found out that it's due to output from Bazel query, but I've never been able to repro it locally. We might need to do a hacky workaround like detecting if there's a test target with this formatting and then manually split it up via Bash. |
Some updates:
|
…3374) * Migrate compute affected tests script to Kotlin. No tests for the script yet. * Add BUILD.bazel docstrings. * Enable linters for scripts code. Fix ktlint issues. * Add tests for the compute affected tests script. Fix some typos & realized bug in the script. * Check out develop first so that exists in CI. See comment in PR for context. * Reorganize & refactor, and improve tests. This splits up the test & script into multiple utility packages (for common & testing utilities), reorganizes the test to be in a proper javatests/ & java/ arrangement, and adds tests for checking the error conditions of the test. * Add tests for executeCommand(). This also refactors that function to be part of a class so that its operating conditions are configurable (namely the process timeout). * Add tests for BazelClient. * Add tests for GitClient. * Add tests for TestBazelWorkspace & documentation. Also, fix a few small issues in the utility. * Add tests git TestGitRepository. * Add support for changing relative develop branch. Also, fix some changed behaviors that broke the affected tests script. * Fix linter error found in CI. * Address reviewer comments. * Remove unnecessary extra checkout from CI. * Disable test that can't pass in CI. See #2691 for context.
) * Add new tests for .bzl files. The new test is meant to determine whether changes to .bzl files can also trigger a hang in CI in the same way that WORKSPACE files do. * Add missing argument to invocation example. * Add debug lines for investigation. * Attempt new test for investigation. This test introduces bzl file indirection closer to #3340 to see if that's the scenario that causes CI to hang. * Move fix from test branch. This change is meant to reduce the filesystem & processing cost for computing tests affected by BUILD/bzl file changes to reduce the chance of it hanging in CI environments. This also disables a test that's known to hang in CI environments to see if it fixes the issue.
Now that batching is in, we're seeing #2844 and #3759 crop up. Also, https://github.com/oppia/oppia-android/pull/3849/checks?check_run_id=3719505383 had basically all of its Bazel workflows fail due to a download flake from Google's CDN. We've seen this before, but never as badly as in that PR. bazelbuild/bazel#12010 is a Bazel-specific issue that had similar reports, and they introduced a flag to improve retries. I'm not sure that will work in cases like we saw above since every Bazel workflow failed (which essentially means there were a bunch of retries). Usually we see just one or two fail, so this flag might work for those cases. Hopefully the all-workflows-fail aren't as common. |
So it looks like some recent PRs have revealed the compute_affected_tests can sometimes timeout. While I think that's believable (since CI runs much slower), I also noticed certain PRs (like #4886) can actually cause a permanent hang due to a bug I discovered in the |
Now that #4092 is in and despite some of the unchecked items in the issue's main message body, I actually don't think there's much more to do in the context of this thread. We've now been using the compute affected tests flow for 3.5 years (with a bunch of updates to improve its performance and robustness). It's now operating quite well, I think--I have high confidence that it does the right thing in most places, and has very few false negatives. I think I can confidently say that this is verified as working now. |
From #1904:
The text was updated successfully, but these errors were encountered: