Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with completed tasks hanging and showing as in-progress #147007

Open
ricardoamador opened this issue Apr 18, 2024 · 9 comments · May be fixed by flutter/cocoon#3675
Open

Issue with completed tasks hanging and showing as in-progress #147007

ricardoamador opened this issue Apr 18, 2024 · 9 comments · May be fixed by flutter/cocoon#3675
Assignees
Labels
P1 High-priority issues at the top of the work list team-infra Owned by Infrastructure team triaged-infra Triaged by Infrastructure team

Comments

@ricardoamador
Copy link
Contributor

Type of Request

bug

Infrastructure Environment

Cocoon Scheduler and Cocoon Dashboard

What is happening?

In the packages repository a particular task has run multiple times even though it had previously passed. The task Mac_arm64 ios_platform_tests_shard_5 stable (https://ci.chromium.org/ui/p/flutter/builders/luci.flutter.prod/Mac_arm64%20ios_platform_tests_shard_5%20stable) for commit 0e3809d995b66af0b54b91d7e2412cf413b8717b is shown to have run many times with passing runs but the task kept getting rerun. See below:
image

Something similar happened a couple of days ago where a task ran 46 times well beyond what the limited number of retries is.
To follow up with this I noticed while recovering tasks that this commit d39830e40c07f71c7086128b45cffeb66be19488 had this test, Mac_arm64 ios_platform_tests_shard_2 master, run 43 times against the commit. I thought we only limited reruns to 3 times?

Thread link here: https://chat.google.com/room/AAAAaqs_Mg0/HZmppnXFtP8/HZmppnXFtP8?cls=10

Steps to reproduce

Step 1:
Step 2:
..
Step n:

Expected results

I expect to see X when Y is finished.

@ricardoamador ricardoamador added team-infra Owned by Infrastructure team P1 High-priority issues at the top of the work list labels Apr 18, 2024
@ricardoamador
Copy link
Contributor Author

A bit of more information while I was looking to recover the task, Firestore only has the task for that commit with attempts 3 appended to the task entry in the datastore:

image

Nothing appended beyond that:

image

And in datastore it tracked the last run as the third failed attempt but was left as "In progress" (The screenshot shows succeeded as I had recovered it based on the newest run):

image

But you can see in the datastore that it is tracking the current number of attempts which shows it at 15.

@yusuf-goog yusuf-goog added the triaged-infra Triaged by Infrastructure team label Apr 18, 2024
@keyonghan
Copy link
Contributor

Seems this is due to missing logic to handle tag current_attempt when rerunning via checkrun Re-run from GitHub UI for post-submit checkruns. https://github.com/flutter/cocoon/blob/main/app_dart/lib/src/service/luci_build_service.dart#L413

When rerunning from GitHub UI, it reset the current_attempt to 1: https://github.com/flutter/cocoon/blob/main/app_dart/lib/src/service/luci_build_service.dart#L616, which causes confusion on Firestore side.

@ricardoamador
Copy link
Contributor Author

This happened again here:
image

@stuartmorgan
Copy link
Contributor

FWIW, the GitHub UI was the only UI to manage tasks in flutter/packages for many years, so trying to retrain everyone to never use that UI—especially when we still have the release task that can only be run/re-run from there—is going to be non-trivial. If we could make it work instead, that would be helpful.

@ricardoamador
Copy link
Contributor Author

ricardoamador commented Apr 19, 2024

@stuartmorgan nah I don't think this is a matter of retraining but just a bug on an edge case caused by a migration to a new datastore.

especially when we still have the release task that can only be run/re-run from there

This is peculiar, are you saying you are running release tasks from the github UI during presubmit? Can you add more context here?

@stuartmorgan
Copy link
Contributor

I don't think this is a matter of retraining

I was referring to this comment in an issue dup'd to this one.

especially when we still have the release task that can only be run/re-run from there

This is peculiar, are you saying you are running release tasks from the github UI during presubmit? Can you add more context here?

No, I'm saying that the post-submit GitHub Actions task called release, which is responsible for actually publishing all of the packages in flutter/packages and is thus a critical part of our CI and gardening responsibilities, is only visible—and thus re-runnable—in the GitHub UI, not in the Flutter dashboard.

@ricardoamador
Copy link
Contributor Author

@stuartmorgan okay, thanks for clarifying.

@keyonghan
Copy link
Contributor

No, I'm saying that the post-submit GitHub Actions task called release, which is responsible for actually publishing all of the packages in flutter/packages and is thus a critical part of our CI and gardening responsibilities, is only visible—and thus re-runnable—in the GitHub UI, not in the Flutter dashboard.

The rerun I referred to in #147033 (comment) was for LUCI (postsubmit) check run only. The GitHub action will not be affected and can be rerun as usual.

To be clear, the the LUCI check run (rerun) connects to cocoon backend to reschedule/update new builds, and is experiencing some issues. These issues can be workaround by calling the cocoon reset-prod-task API directly.

Anyway, I will give flutter/cocoon#3675 a high priority this week for a fix.

@stuartmorgan
Copy link
Contributor

The rerun I referred to in #147033 (comment) was for LUCI (postsubmit) check run only. The GitHub action will not be affected and can be rerun as usual.

I understand that, but what I was saying is that everyone on the ecosystem gardener rotation:

  • has muscle memory to retry failing post-submits from the GitHub UI, because it's how everything worked in that tree for many years, and
  • still has to interact with that UI when there are failures (because failures in any post-submit test will cause release to fail, by design, and so release has to be re-run with other tests).

The combination of those two things makes it harder in practice to stop using it for LUCI tests than it is in theory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 High-priority issues at the top of the work list team-infra Owned by Infrastructure team triaged-infra Triaged by Infrastructure team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants