Prometheus: fix offline worker metrics #1135

Tomasz-Kluczkowski · 2021-08-09T22:05:02Z

@mher A fully working (tested with Prometheus/Grafana) code now pushed. All metrics for offline workers are removed and grafana eventually runs out of data for its graphs and removes all the plots for non existent workers.

Tests added.
My own comments added for guidance for the reviewer.

Please let me know if this works.
Would be nice if someone actually run this and observe graphs in grafana....

Log when running this after celery worker is stopped:

[I 210815 23:22:43 web:2239] 200 GET /metrics (127.0.0.1) 3.92ms
[I 210815 23:22:58 web:2239] 200 GET /metrics (127.0.0.1) 3.41ms
[I 210815 23:23:12 web:2239] 200 GET /dashboard?json=1&_=1629066192482 (::1) 0.53ms
[I 210815 23:23:13 web:2239] 200 GET /metrics (127.0.0.1) 3.85ms
[D 210815 23:23:28 prometheus_metrics:57] Removed label set: ('worker1-concurrency-1@XPS-15-9560', 'task-started', 'tasks.add') for metric counter:flower_events
[D 210815 23:23:28 prometheus_metrics:57] Removed label set: ('worker1-concurrency-1@XPS-15-9560', 'task-received', 'tasks.add') for metric counter:flower_events
[D 210815 23:23:28 prometheus_metrics:57] Removed label set: ('worker1-concurrency-1@XPS-15-9560', 'task-succeeded', 'tasks.add') for metric counter:flower_events
[D 210815 23:23:28 prometheus_metrics:57] Removed label set: ('worker1-concurrency-1@XPS-15-9560', 'task-failed', 'tasks.add') for metric counter:flower_events
[D 210815 23:23:28 prometheus_metrics:57] Removed label set: ('worker1-concurrency-1@XPS-15-9560', 'tasks.add') for metric histogram:flower_task_runtime_seconds
[D 210815 23:23:28 prometheus_metrics:57] Removed label set: ('worker1-concurrency-1@XPS-15-9560', 'tasks.add') for metric gauge:flower_task_prefetch_time_seconds
[D 210815 23:23:28 prometheus_metrics:57] Removed label set: ('worker1-concurrency-1@XPS-15-9560', 'tasks.add') for metric gauge:flower_worker_prefetched_tasks
[D 210815 23:23:28 prometheus_metrics:57] Removed label set: ('worker1-concurrency-1@XPS-15-9560',) for metric gauge:flower_worker_online
[D 210815 23:23:28 prometheus_metrics:57] Removed label set: ('worker1-concurrency-1@XPS-15-9560',) for metric gauge:flower_worker_number_of_currently_executing_tasks
[I 210815 23:23:28 web:2239] 200 GET /metrics (127.0.0.1) 6.30ms

… transient prometheus labels. This is just a sketch. Tested positive that a label is removed and graph eventually disappears from prometheus (after old data is no longer in range). Requires moving commond logic into a separate EventsService.

…hboard. Moved code to generate current state dashboard into EventsState object. Start using static typing in the files.

Moved decision making and filtering completely to EventsState object. Now we need one call to get current, online workers data.

The amount of code for the metrics and its place inside of EventsState object was wrong making testing very hard. Moved all code related to the metrics outside of events.py file into prometheus_metrics.py. Add methods to extract label sets that belong to offline workers and remove them from metric. No matter how hard I tried not to access any private attribute of the metric itself I ended up needing it somewhere. For now I am just using metric._labelnames as it is highly unlikely that it will change.

If we have offline workers we will remove metrics for them.

Now metrics related code is in prometheus_metrics.py. The EventState object is responsible for gathering offline/online workers data and filtering it for the dashboard view and also delegates the removal of metrics for offline workers to the PrometheusMetrics object.

Mock path needed changing after code responsible for filtering based on options was moved into the EventsState object.

Tomasz-Kluczkowski · 2021-08-14T14:26:41Z

flower/api/prometheus_metrics.py

+ if worker is None or worker not in offline_workers:
+ continue
+
+ label_sets_for_offline_workers.add(tuple([labels[label_name] for label_name in metric._labelnames]))


It is important to generate label values in correct order as in the ._labelnames, otherwise we will not remove it from the metric (KeyError will be raised which we catch and pass). That's why I access this private attribute here. It is unlikely they will change implementation and remove this attribute... I think it is small enough smell to let it exist for now.

If I find the time, I would like to make it into a read-only property of a metric in prometheus_client project itself - then I will amend this here or even better add a method in prometheus_client to get current label sets as well...

Tomasz-Kluczkowski · 2021-08-14T14:29:20Z

flower/events.py

+ offline_workers=offline_workers
+ )
+
+ def get_online_workers(self) -> Dict[str, Any]:


You probably notice I started introducing typing into the project.
The benefits are:

any modern IDE gives you nice autocompletion

easier reading/understanding inputs and outputs especially between multiple methods/objects

Tomasz-Kluczkowski · 2021-08-14T14:33:42Z

flower/api/prometheus_metrics.py

+ try:
+ metric.remove(*label_set)
+ logger.debug('Removed label set: %s for metric %s', label_set, metric)
+ except KeyError:


Since the metric could already be removed in another thread, it's best to just pass

Tomasz-Kluczkowski · 2021-08-14T14:35:37Z

@mher I pushed more code - please have a quick look and let me know if this solution is acceptable, then I will add all the missing tests.
All was tested manually multiple times with actual Prometheus/Grafana/Celery/Flower running.

Tomasz-Kluczkowski · 2021-08-14T14:54:57Z

flower/api/prometheus_metrics.py

+ for sampled_metric in sampled_metrics:
+ for sample in sampled_metric.samples:
+ labels = sample.labels
+ worker = labels.get(LabelNames.WORKER)


I prefer to use enums instead of strings to avoid typo problems hence I used LabelNames enum in metric definitions and use it here to make sure we access the correct one. If you think it's too much....we can always go back to strings.

We are really interested only in getting online workers or a list of offline ones externally. Moved methods around to allow easier reading (tried to keep them in order of calls).

…for_offline_workers.

Tomasz-Kluczkowski · 2021-08-15T22:08:07Z

flower/api/prometheus_metrics.py

+ )
+
+ @property
+ def transient_metrics(self) -> List[MetricWrapperBase]:


If we think that this should be configurable, maybe we could make a second PR and add it as an additional feature. For now I think it makes total sense just to wipe any existing records from metrics for label sets containing offline workers. Otherwise the graphs are polluted with a reading from like a few days ago...

Tomasz-Kluczkowski · 2021-08-15T22:16:21Z

flower/api/prometheus_metrics.py

+ def _get_label_sets_for_offline_workers(
+ metric: MetricWrapperBase, offline_workers: Set[str]
+ ) -> Set[Tuple[str, ...]]:
+ sampled_metrics = metric.collect()


.collect() gets a Metric object with samples. The samples have a labels dictionary representing the label name to label value. We are interested in getting those values and checking if the one under key 'worker' is in our offline_workers set.

Not the prettiest way of getting this data but collect is the only common, public method of all the metric types we use that provides any info like this...

I hope I can at some point bake it into the metrics themselves.

Mention that prometheus metrics are also removed based on this setting.

danyi1212 · 2022-09-04T12:27:01Z

@mher Can please provide details when this will be added?

drummerwolli · 2022-09-12T09:00:32Z

@Tomasz-Kluczkowski can you update your branch? then at least it will be possible for people to run (and test) your fork until this finally gets merged.

this not being fixed yet is a real pity. would love to see this move to the main branch and get properly released.

Tomasz-Kluczkowski added 8 commits August 9, 2021 22:39

Remove unused import.

1620e28

Use methods moved to EventsState object to generate current state das…

9ec035a

…hboard. Moved code to generate current state dashboard into EventsState object. Start using static typing in the files.

Encapsulate getting current, online workers data into one method call.

ffba617

Moved decision making and filtering completely to EventsState object. Now we need one call to get current, online workers data.

Use new code to cleanse metrics from stale records.

a7d967a

If we have offline workers we will remove metrics for them.

Fix test.

83e3915

Mock path needed changing after code responsible for filtering based on options was moved into the EventsState object.

Tomasz-Kluczkowski commented Aug 14, 2021

View reviewed changes

Remove test log message.

542caee

Tomasz-Kluczkowski commented Aug 14, 2021

View reviewed changes

Tomasz-Kluczkowski added 7 commits August 15, 2021 16:12

Remove unused import.

a8c2e2f

Add test checking /metrics endpoint response after worker goes offline.

8356853

Decided to make _get_workers private to EventsState.

fb74e95

We are really interested only in getting online workers or a list of offline ones externally. Moved methods around to allow easier reading (tried to keep them in order of calls).

Rename test.

9c157d1

Add test for get_online_workers, get_offline_workers, remove_metrics_…

9bcd382

…for_offline_workers.

Add missing empty line.

f45c64d

Add test for PrometheusMetrics object.

1a48a12

Tomasz-Kluczkowski commented Aug 15, 2021

View reviewed changes

Tomasz-Kluczkowski changed the title ~~WIP: Prometheus fix offline worker metrics~~ Prometheus: fix offline worker metrics Aug 15, 2021

Tomasz-Kluczkowski commented Aug 15, 2021

View reviewed changes

Update purge_offline_workers docs.

e0fc2d1

Mention that prometheus metrics are also removed based on this setting.

This was referenced Aug 19, 2021

Add public, read-only access to label names, label values of any metric prometheus/client_python#689

Open

Offline workers remain in metrics #1128

Open

zemek referenced this pull request in headway/flower Apr 10, 2023

only send the aggregate metric, instead of per worker

d1f6b0c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prometheus: fix offline worker metrics #1135

Prometheus: fix offline worker metrics #1135

Tomasz-Kluczkowski commented Aug 9, 2021 •

edited

Tomasz-Kluczkowski Aug 14, 2021 •

edited

Tomasz-Kluczkowski Aug 14, 2021

Tomasz-Kluczkowski Aug 14, 2021 •

edited

Tomasz-Kluczkowski commented Aug 14, 2021

Tomasz-Kluczkowski Aug 14, 2021

Tomasz-Kluczkowski Aug 15, 2021

Tomasz-Kluczkowski Aug 15, 2021 •

edited

danyi1212 commented Sep 4, 2022

drummerwolli commented Sep 12, 2022

Prometheus: fix offline worker metrics #1135

Are you sure you want to change the base?

Prometheus: fix offline worker metrics #1135

Conversation

Tomasz-Kluczkowski commented Aug 9, 2021 • edited

Tomasz-Kluczkowski Aug 14, 2021 • edited

Choose a reason for hiding this comment

Tomasz-Kluczkowski Aug 14, 2021

Choose a reason for hiding this comment

Tomasz-Kluczkowski Aug 14, 2021 • edited

Choose a reason for hiding this comment

Tomasz-Kluczkowski commented Aug 14, 2021

Tomasz-Kluczkowski Aug 14, 2021

Choose a reason for hiding this comment

Tomasz-Kluczkowski Aug 15, 2021

Choose a reason for hiding this comment

Tomasz-Kluczkowski Aug 15, 2021 • edited

Choose a reason for hiding this comment

danyi1212 commented Sep 4, 2022

drummerwolli commented Sep 12, 2022

Tomasz-Kluczkowski commented Aug 9, 2021 •

edited

Tomasz-Kluczkowski Aug 14, 2021 •

edited

Tomasz-Kluczkowski Aug 14, 2021 •

edited

Tomasz-Kluczkowski Aug 15, 2021 •

edited