-
Notifications
You must be signed in to change notification settings - Fork 521
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix part of #2691: Mitigate affected tests script CI hang problem #3416
Fix part of #2691: Mitigate affected tests script CI hang problem #3416
Conversation
The new test is meant to determine whether changes to .bzl files can also trigger a hang in CI in the same way that WORKSPACE files do.
NB: I'm keeping this in draft state to investigate the underlying problem. We discovered that while the new Kotlin script introduced in #3374 mitigated the formatting issue that prevented CI from running in #3340, it didn't stop a new issue of the script hanging. The current working theory is that the rbuildfiles routine has issues in CI when trying to find the dependencies for the WORKSPACE file, however #3340 doesn't change WORKSPACE. I'm now thinking that any changes to bzl files might have the same effect, so I've introduced a new test where we change a bzl file to determine if it also causes CI to hang. |
Hmm. Unfortunately, the tests passed. Still thinking about this. |
This test introduces bzl file indirection closer to #3340 to see if that's the scenario that causes CI to hang.
This change is meant to reduce the filesystem & processing cost for computing tests affected by BUILD/bzl file changes to reduce the chance of it hanging in CI environments. This also disables a test that's known to hang in CI environments to see if it fixes the issue.
The WORKSPACE test is now passing with the mitigation. :) That introduces a lot more confidence that this fix could work longer term. |
@anandwana001 & @Sparsh1212 PTAL. Akshay: note that this is blocking Sparsh's and Abhay's downstream PRs since their PR's tests can't be verified in CI due to the compute affected tests CI issue. |
LGTM! |
Unassigning @Sparsh1212 since they have already approved the PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As all ci test as passing, LGTM
Fix part of #2691.
Explanation
Introduces a mitigation for the continued issue of certain PRs causing the ComputeAffectedTests target to hang. There's a bunch of context of different situations where this was observed in #2961 and connected PRs, but principally the issue seems to be that sibling queries that return large numbers of results can cause Bazel to hang in CI environments. My theory is that this has something to do with both fewer environment resources & a difference in the filesystem used in CI.
The mitigation tries to reduce the chance the error happens in two ways:
Note that (1) probably means the query is overall slower in non-CI environments, but it lets us do (2) which is probably the actual fix. For some files, the siblings call alone would expand to >1400 results. With filtering, this goes down to <100 for the same target. There's one important caveat to note here: in special cases where a non-test, non-library target is affected we will not find corresponding tests. This seems fairly unlikely given that this entire mechanism is only the secondary means of finding affected tests, anyway.
Regarding testing, one new test was introduced to verify a scenario where bzl file changes affect tests. This test passed without the fix in CI, but it's still a nice test to have. Further, the WORKSPACE test that previously consistently hung is now passing in CI with this mitigation which provides a fairly strong indication that this fix should work for us at least in the medium term. Finally, #3417 demonstrated the fix works in a real situation where an originating PR's (#3340) tests would not be computed properly in CI.
Checklist