Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix part of #2691: Mitigate affected tests script CI hang problem #3416

Merged

Conversation

BenHenning
Copy link
Member

@BenHenning BenHenning commented Jul 6, 2021

Fix part of #2691.

Explanation

Introduces a mitigation for the continued issue of certain PRs causing the ComputeAffectedTests target to hang. There's a bunch of context of different situations where this was observed in #2961 and connected PRs, but principally the issue seems to be that sibling queries that return large numbers of results can cause Bazel to hang in CI environments. My theory is that this has something to do with both fewer environment resources & a difference in the filesystem used in CI.

The mitigation tries to reduce the chance the error happens in two ways:

  1. By splitting up the query such that we first compute the list of rbuildfile deps, then for each dependency we calculate siblings before combining the list together for the final rdeps & test computation (moving from 1 query to 2+N for N related build files).
  2. By introducing filtering on the siblings query to only apply to tests and Android libraries.

Note that (1) probably means the query is overall slower in non-CI environments, but it lets us do (2) which is probably the actual fix. For some files, the siblings call alone would expand to >1400 results. With filtering, this goes down to <100 for the same target. There's one important caveat to note here: in special cases where a non-test, non-library target is affected we will not find corresponding tests. This seems fairly unlikely given that this entire mechanism is only the secondary means of finding affected tests, anyway.

Regarding testing, one new test was introduced to verify a scenario where bzl file changes affect tests. This test passed without the fix in CI, but it's still a nice test to have. Further, the WORKSPACE test that previously consistently hung is now passing in CI with this mitigation which provides a fairly strong indication that this fix should work for us at least in the medium term. Finally, #3417 demonstrated the fix works in a real situation where an originating PR's (#3340) tests would not be computed properly in CI.

Checklist

  • The PR title starts with "Fix #bugnum: ", followed by a short, clear summary of the changes. (If this PR fixes part of an issue, prefix the title with "Fix part of #bugnum: ...".)
  • The PR explanation includes the words "Fixes #bugnum: ..." (or "Fixes part of #bugnum" if the PR only partially fixes an issue).
  • The PR follows the style guide.
  • The PR does not contain any unnecessary auto-generated code from Android Studio.
  • The PR is made from a branch that's not called "develop".
  • The PR is made from a branch that is up-to-date with "develop".
  • The PR's branch is based on "develop" and not on any other branch.
  • The PR is assigned to an appropriate reviewer in both the Assignees and the Reviewers sections.

The new test is meant to determine whether changes to .bzl files can
also trigger a hang in CI in the same way that WORKSPACE files do.
@BenHenning
Copy link
Member Author

NB: I'm keeping this in draft state to investigate the underlying problem. We discovered that while the new Kotlin script introduced in #3374 mitigated the formatting issue that prevented CI from running in #3340, it didn't stop a new issue of the script hanging. The current working theory is that the rbuildfiles routine has issues in CI when trying to find the dependencies for the WORKSPACE file, however #3340 doesn't change WORKSPACE. I'm now thinking that any changes to bzl files might have the same effect, so I've introduced a new test where we change a bzl file to determine if it also causes CI to hang.

@BenHenning
Copy link
Member Author

Hmm. Unfortunately, the tests passed. Still thinking about this.

This test introduces bzl file indirection closer to #3340 to see if
that's the scenario that causes CI to hang.
This change is meant to reduce the filesystem & processing cost for
computing tests affected by BUILD/bzl file changes to reduce the chance
of it hanging in CI environments.

This also disables a test that's known to hang in CI environments to see
if it fixes the issue.
@BenHenning
Copy link
Member Author

The WORKSPACE test is now passing with the mitigation. :) That introduces a lot more confidence that this fix could work longer term.

@BenHenning BenHenning changed the title Fix part of #2691: Mitigate affected tests script rbuildfiles issues Fix part of #2691: Mitigate affected tests script CI hang problem Jul 8, 2021
@BenHenning BenHenning marked this pull request as ready for review July 8, 2021 09:37
@BenHenning
Copy link
Member Author

BenHenning commented Jul 8, 2021

@anandwana001 & @Sparsh1212 PTAL.

Akshay: note that this is blocking Sparsh's and Abhay's downstream PRs since their PR's tests can't be verified in CI due to the compute affected tests CI issue.

@Sparsh1212
Copy link
Contributor

LGTM!

@oppiabot
Copy link

oppiabot bot commented Jul 8, 2021

Unassigning @Sparsh1212 since they have already approved the PR.

Copy link
Contributor

@anandwana001 anandwana001 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As all ci test as passing, LGTM

@anandwana001 anandwana001 merged commit d99fd17 into develop Jul 8, 2021
@anandwana001 anandwana001 deleted the mitigate-affected-tests-script-rbuildfiles-issues branch July 8, 2021 10:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants