-
Notifications
You must be signed in to change notification settings - Fork 564
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Monitoring] Enrich UNTRIAGED_TESTCASE_AGE metric to track testcases stuck in analyze #4547
Conversation
TESTCASE_UPLOAD_TRIAGE_DURATION = monitor.CumulativeDistributionMetric( | ||
'uploaded_testcase_analysis/triage_duration_secs', | ||
TESTCASE_TRIAGE_DURATION = monitor.CumulativeDistributionMetric( | ||
'testcase_analysis/triage_duration_secs', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's rename this to triage_duration_hours (since the unit is hours)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
# File the bug first and then create filed bug metadata. | ||
if not _file_issue(testcase, issue_tracker, throttler): | ||
_emit_untriaged_testcase_age_metric(testcase, PENDING_FILING) | ||
_increment_untriaged_testcase_count(testcase.job_type, PENDING_FILING) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a lot of stuff we're doing here for metrics. Is any of it tracked in tests?
Motivation
We currently have no way to tell if analyze task was successfully executed. The TESTCASE_UPLOAD_TRIAGE_DURATION metric from #4364 would only track duration for tasks that did finish.
An analyze_pending field is added to the Testcase entity in datastore, which is set to False by default, to True for manually uploaded testcases, and to False once analyze task postprocess runs.
It also increments the UNTRIAGED_TESTCASE_AGE metric from #4381 with a status label, so we can know at what step the testcase is stuck, thus allowing us to alert if analyze is taking longer to finish than expected.
The alert itself could be, for instance, P50 age of untriaged testcase (status=analyze_pending) > 3h.
Also, this retroactively addresses comments from #4481: