Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose paused and retired workers separately in prometheus #8613

Merged
merged 4 commits into from
Jul 18, 2024

Conversation

phofl
Copy link
Collaborator

@phofl phofl commented Apr 11, 2024

Closes #xxxx

  • Tests added / passed
  • Passes pre-commit run --all-files

cc @ntabris for the grafana dashboards

@phofl phofl requested a review from fjetter as a code owner April 11, 2024 13:14
@ntabris
Copy link
Contributor

ntabris commented Apr 11, 2024

Having paused and retiring and paused_or_retiring makes things more complicated for me, because various things would need to include that iff paused and retiring are not present (and not include if they are, otherwise we'd be double counting).

Thoughts about removing paused_or_retiring? I know this would be a breaking change in some sense, but it's also kinda a breaking change to have more non-exclusive states.

@fjetter
Copy link
Member

fjetter commented Apr 11, 2024

Thoughts about removing paused_or_retiring? I know this would be a breaking change in some sense, but it's also kinda a breaking change to have more non-exclusive states.

I think we don't have any strong preferences about keeping/removing the paused_or_retiring metric. Can you elaborate how adding those would be a breaking change?

@ntabris
Copy link
Contributor

ntabris commented Apr 11, 2024

I think we don't have any strong preferences about keeping/removing the paused_or_retiring metric. Can you elaborate how adding those would be a breaking change?

I said "kinda". It messes up anything that assumes states other than "connected" are exclusive, or that (eg) a chart of all states other than "connected" would make sense.

Instead, one would need logic that includes paused_or_retiring or [paused and retiring] but not both... which isn't very straightforward in Prometheus (I'm still thinking about how to do this).

@fjetter
Copy link
Member

fjetter commented Apr 11, 2024

Instead, one would need logic that includes paused_or_retiring or [paused and retiring] but not both... which isn't very straightforward in Prometheus (I'm still thinking about how to do this).

Right, that makes sense. It's a shame we didn't introduce the split metrics earlier. For now, we're mostly interested in the paused metric because this may highlight a problematic cluster behavior (too small in size) while the retiring signal is a little bit of noise and is usually harmless.

So I don't have a solution for the "nice chart" problem but when using this as a tag, the retired metric as a standalone thing already makes sense.

Copy link
Contributor

github-actions bot commented Apr 11, 2024

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

    29 files  ±0      29 suites  ±0   11h 52m 24s ⏱️ - 2m 15s
 4 087 tests ±0   3 970 ✅ +1    112 💤 ±0  4 ❌  - 2  1 🔥 +1 
55 287 runs  +1  52 844 ✅ +4  2 438 💤  - 1  4 ❌  - 3  1 🔥 +1 

For more details on these failures and errors, see this check.

Results for commit dfc84cb. ± Comparison against base commit d68a5d9.

♻️ This comment has been updated with latest results.

@phofl
Copy link
Collaborator Author

phofl commented Apr 11, 2024

I'd be fine with removing the paused_or_retiring metric, my main concern was backward compatibility.

We mostly care about paused as @fjetter said

@hendrikmakait
Copy link
Member

FWIW, I'm +1 for removing paused_or_retiring. We don't have a good story for backward compatibility yet, and this change seems like a net improvement.

@hendrikmakait
Copy link
Member

@phofl: Should we move this forward by removing paused_or_retiring?

@phofl
Copy link
Collaborator Author

phofl commented Jul 17, 2024

Yep lets do that

Copy link
Member

@hendrikmakait hendrikmakait left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @phofl! One out-of-scope comment that doesn't need to be addressed here as it's not a regression.

worker_states.add_metric(
["paused_or_retiring"], len(self.server.workers) - len(self.server.running)
["retiring"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of scope: Technically, this definition is slightly off because there's a brief delay between registering a worker on the scheduler (with the worker being in the init state) and updating its state to running on the scheduler.

Comment on lines +53 to +55
paused_workers = len(
[w for w in self.server.workers.values() if w.status == Status.paused]
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect looping over this once every poll should be fine, we can address this later if it's found to become a problem with very large clusters.

@phofl phofl merged commit 30c0d29 into dask:main Jul 18, 2024
29 of 36 checks passed
@phofl phofl deleted the prometheus_paused_workers branch July 18, 2024 19:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants