Offline workers remain in metrics #1128

xneg · 2021-07-20T12:03:45Z

Describe the bug
We are deploing celery workers in k8s. Flower instance is also deployed in a separate pod. After redeploy, when one pod is killed and new pod is created, worker that gone still remains in metrics.
There is only one online worker in UI (with purge_offline_workers enabled) but all offline workers are still listed in metrics.

# HELP flower_worker_online Worker online status
# TYPE flower_worker_online gauge
flower_worker_online{worker="celery@superset-worker-788b99d877-6xm2h"} 1.0
flower_worker_online{worker="celery@superset-worker-787cd495dc-kjvt5"} 1.0
flower_worker_online{worker="celery@superset-worker-7d579647b9-hhf7c"} 1.0
flower_worker_online{worker="celery@superset-worker-75556c6c58-l46cm"} 1.0
flower_worker_online{worker="celery@superset-worker-7d579647b9-lpcfd"} 1.0
flower_worker_online{worker="celery@superset-worker-6f76c5ddf6-tl8bk"} 1.0
flower_worker_online{worker="celery@superset-worker-6c979f7b6f-4m65t"} 1.0

Looks like flower does not receive worker-offline event.

To Reproduce
Steps to reproduce the behavior:

Run celery worker with command celery worker --app=superset.tasks.celery_app:app -Ofair -l INFO
There is config for celery

class CeleryConfig(object):
    BROKER_URL = f"redis://{REDIS_HOST}:{REDIS_PORT}/{REDIS_CELERY_DB}"

Run flower with command celery --broker=redis://${REDIS_HOST}:${REDIS_PORT}/${REDIS_RESULTS_DB} flower --purge_offline_workers=10
See in metrics
flower_worker_online{worker="celery@superset-worker-788b99d877-6xm2h"} 1.0
Restart worker pod.
See in metrics

flower_worker_online{worker="celery@superset-worker-788b99d877-6xm2h"} 1.0
flower_worker_online{worker="celery@superset-worker-787cd495dc-kjvt5"} 1.0

Expected behavior
Online workers in metrics should respect the actual worker's state.

Screenshots

The text was updated successfully, but these errors were encountered:

Tomasz-Kluczkowski · 2021-07-25T20:39:05Z

Hi @xneg,

This is on my list as I have similar issue in our k8s deployment. I am aiming at re-using the purge-offline-workers CLI arg for this or some smarter way of achieving this (suggestions are welcome). Currently purge-offline-workers is not being used for anything Prometheus related.

For us in our prod cluster it is when the worker goes offline, it stays offline in Prometheus metric (as that is the last metric produced by Flower for this particular worker) when in fact there is a new pod that replaced it already.

It is just Flower does not understand pod replacement (and never will), so we have to figure out how to clear the offline workers that are hanging.

For us we do not have the problem with no offline signal but our workers are started with -E option which makes sure all events are sent - could you check if that changes the behaviour for you guys please?

I will also look at what happens when I run the worker the way you do and see if the way Flower works can be improved.

I think if you enable -E option for your workers - this time the workers from old deployments will show as offline forever (https://docs.celeryproject.org/en/stable/reference/cli.html#cmdoption-celery-worker-E) because Flower does not understand that they got replaced by a restarted / redeployed pod.

One suggestion for a work-around temporarily guys had at my place was enforce re-boot of Flower on each deployment (by maybe having always changing label or something like that) but then we may lose continuity of measurements/tasks data which is not something I like and it does not help when a monitored worker simply dies due to OOM or other stuff.

Anyway - thx for reporting and I want to get this sorted asap as I need it working better at my place but you know - it's holidays, will take some time...

Suggestions are welcome.

@mher Please assign this one to me.

xneg · 2021-07-26T08:11:58Z

Hello @Tomasz-Kluczkowski Thank you for your response!

Enabling option -E does not help.
So, again our steps:

Run worker on pod with command celery --app=superset.tasks.celery_app:app worker -E
Kill pod with kubectl delete pods <pod_name>
After that we have +1 worker in metrics.

I still think flower does not receive worker-offline event.
But I noticed another issue: we run our workers with celery==4.4.7 but flower requires celery >= 5.0.5.
Maybe this is the case?

xneg · 2021-07-26T08:42:48Z

I can't figure out how to debug it right now, so what I write next is just an assumption.
Looks like workers do not send online/offline events. And all the knowledge flower has about workers comes only from heartbeats. With purge_offline_workers you can filter workers without heartbeat but this does not work now for metrics.

Tomasz-Kluczkowski · 2021-07-26T15:17:45Z

This is a wider problem as all metrics are affected by it in docker/kubernetes.

The final metric data point registered remains for that label forever.

This is for worker online as well as for example number of tasks prefetched where the last value of tasks prefetched will be shown in grafana's graph forever.

My thoughts so far:

The only way I see it fixed is once a worker is considered offline for long enough (so re-purposing the purge offline workers cli arg) I have to remove all data associated with that worker from ALL Prometheus metrics, not only from the worker online one.

This means some data loss (but it is for a worker that is no longer in existence) but atm I do not see a nicer way of doing it for k8s.

Counter argument to that is that we may want to know what was happening to that worker before it went offline for good...

I will do some reading on python Prometheus client and see what is the best approach but if anyone has better ideas - give us a shout.

xneg · 2021-07-26T15:26:50Z

I think we should divide gauge and counter metrics.
Counter metrics tend to increase. E.g. I want to see overall done tasks per day including that from offline workers.
But gauge metrics represent current state and they should respect the real context.

The only way I see it fixed is once a worker is considered offline for long enough (so re-purposing the purge offline workers cli arg) I have to remove all data associated with that worker from ALL Prometheus metrics, not only from the worker online one.

I see implementation problem here. purge-offline-workers works when you get data through HTTP API. This setting just filters out stale workers. You can't decrease metric cause you don't catch any event that worker is offline. To implement the same behavior with Prometheus metrics you should inject this logic in Prometheus endpoint. This is not what comes out of the box but it is possible.

beje2k15 · 2021-08-23T08:56:28Z

I'm struggeling with the same issue, but delayed looking into it until I upgraded to celery 5. As this is done I looked into this issue now and wanted to share my thoughts:

I think the main reason for this issue it that flower/celery isn't quite fit for k8s. Especially flower still handles worker like they are manually setup on nodes, in terms that it expects worker to come back online later. With ReplicaSets in k8s however this isn't the case as a new pod will have a new randon name. A way to work around this would be to deploy all workers with a StatefulSet which will cause all pods to have preditable names and a restarted pod will inherits the name of the old pod it replaces (e.g. worker-3 crashes so a new pod names worker-3 will replace the old one)

While this is a solution which can be implemented right now without any changes in flower, I think it isn't the best solution as we don't really need any state aside from the preditable hostname and therfore a worker is better setup using a ReplicaSet. Therefore flower should support in some manner never reappearing workers.

In flower/events.py is code which handles worker-offline events and metrics are only updated to mark the worker as offline. However it is also possible to delete a metric:

>>> from prometheus_client import Gauge
>>> # Add a gauge metrics
>>> g = Gauge("test", "Test", ["instance"])
>>> # set some values, like worker online
>>> g.labels(instance="1").set(1)
>>> g.labels(instance="2").set(1)
>>> # test it
>>> g.collect()
[Metric(test, Test, gauge, , [Sample(name='test', labels={'instance': '1'}, value=1.0, timestamp=None, exemplar=None), Sample(name='test', labels={'instance': '2'}, value=1.0, timestamp=None, exemplar=None)])]
>>> # current implementation of worker-offline
>>> g.labels(instance="2").set(0)
>>> g.collect()
[Metric(test, Test, gauge, , [Sample(name='test', labels={'instance': '1'}, value=1.0, timestamp=None, exemplar=None), Sample(name='test', labels={'instance': '2'}, value=0.0, timestamp=None, exemplar=None)])]
>>> # additionally the metric can be deleted completly:
>>> g.remove("2")
>>> g.collect()
[Metric(test, Test, gauge, , [Sample(name='test', labels={'instance': '1'}, value=1.0, timestamp=None, exemplar=None)])]
>>>

Just adding a self.metrics.worker_online.remove(worker_name) in the event handling probably doesn't cut it. The metric would vanish before a value of 0 could be scraped. Any alerting setup to match offline workers would fail to notice the worker going offline. Probably the best solution would be to add the remove call to the cleanup code which is controlled by purge_offline_workers . This way a worker would go offline and after the timeout set with purge_offline_workers the metric would vanish and stop cluttering dashboards.

At the moment I'm not familiar enough with the codebase to propose a pull request. Perhaps I could do this later, but I would like to hear your opinion(s) in this matter first.

Tomasz-Kluczkowski · 2021-08-23T10:37:57Z

Hey it makes sense. I have made a first attempt at it in a PR #1135 but still no comments from @mher/anyone else and my solution is far from what I would like to see :). Anyway it works and should be some starting point. Would be nice to maybe collaborate on this and fix the kinks - your review could be a good opener plz. I also opened an issue in promethus/client_python/issues/689 asking if they would be ok adding labelname/labelvalues as read only properties to the base metric object which would greatly simplify all this effort but so far no answer there… On 23 Aug 2021, at 09:56, Benjamin Jesser ***@***.***> wrote: I'm struggeling with the same issue, but delayed looking into it until I upgraded to celery 5. As this is done I looked into this issue now and wanted to share my thoughts: I think the main reason for this issue it that flower/celery isn't quite fit for k8s. Especially flower still handles worker like they are manually setup on nodes, in terms that it expects worker to come back online later. With ReplicaSets in k8s however this isn't the case as a new pod will have a new randon name. A way to work around this would be to deploy all workers with a StatefulSet which will cause all pods to have preditable names and a restarted pod will inherits the name of the old pod it replaces (e.g. worker-3 crashes so a new pod names worker-3 will replace the old one) While this is a solution which can be implemented right now without any changes in flower, I think it isn't the best solution as we don't really need any state aside from the preditable hostname and therfore a worker is better setup using a ReplicaSet. Therefore flower should support in some manner never reappearing workers. In flower/events.py<https://github.com/mher/flower/blob/master/flower/events.py#L97> is code which handles worker-offline events and metrics are only updated to mark the worker as offline. However it is also possible to delete a metric:

>> from prometheus_client import Gauge >> # Add a gauge metrics >> g = Gauge("test", "Test", ["instance"]) >> # set some values, like worker online >> g.labels(instance="1").set(1) >> g.labels(instance="2").set(1) >> # test it >> g.collect()

[Metric(test, Test, gauge, , [Sample(name='test', labels={'instance': '1'}, value=1.0, timestamp=None, exemplar=None), Sample(name='test', labels={'instance': '2'}, value=1.0, timestamp=None, exemplar=None)])]

>> # current implementation of worker-offline >> g.labels(instance="2").set(0) >> g.collect()

[Metric(test, Test, gauge, , [Sample(name='test', labels={'instance': '1'}, value=1.0, timestamp=None, exemplar=None), Sample(name='test', labels={'instance': '2'}, value=0.0, timestamp=None, exemplar=None)])]

>> # additionally the metric can be deleted completly: >> g.remove("2") >> g.collect()

[Metric(test, Test, gauge, , [Sample(name='test', labels={'instance': '1'}, value=1.0, timestamp=None, exemplar=None)])]

>>

Just adding a self.metrics.worker_online.remove(worker_name) in the event handling probably doesn't cut it. The metric would vanish before a value of 0 could be scraped. Any alerting setup to match offline workers would fail to notice the worker going offline. Probably the best solution would be to add the remove call to the cleanup code which is controlled by purge_offline_workers . This way a worker would go offline and after the timeout set with purge_offline_workers the metric would vanish and stop cluttering dashboards. At the moment I'm not familiar enough with the codebase to propose a pull request. Perhaps I could do this later, but I would like to hear your opinion(s) in this matter first. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#1128 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AGGVI2KCLASAKJWDRYPAVQLT6IEMRANCNFSM5AVVAOKQ>.

beje2k15 · 2021-08-23T12:00:17Z

I'll try to dig into it after my vacation.

The Isse fpr prometheus-client would be handy for flower_events_total. In my first post I totally forgot about this metric and thought only about worker-online metrics.

Ragarding purge_offline_workers I found the code in views/dashboard.py, which is probably not suitable for this changes (I guess metrics deletion would then only happen if the dashboard page is loaded/refreshed). There should probably an extra check in the event handling logic so this happens independently of page loads. Maybe the worker deletion there can just completely be moved to the event handling

Tomasz-Kluczkowski · 2021-08-23T12:10:48Z

The PR I made moves the logic for finding offline workers out of dashboard view into the EventState and reuses it for metrics removal and the dashboard. Also I remove any metric for an offline worker - they all become stale if a worker is replaced in k8s, makes no sense to keep some metric in grafana from like 3 days ago… I am just not particularly happy with how I have to operate on a metric to get its data to know which label values should be removed (figure out which contain the offline worker). I was considering keeping labelvalues in the EventState object but that would be just data duplication… Let me know when you read it after your hols and maybe we can make it more beautiful… Have a nice holiday. On 23 Aug 2021, at 13:00, Benjamin Jesser ***@***.***> wrote: I'll try to dig into it after my vacation. The Isse fpr prometheus-client would be handy for flower_events_total. In my first post I totally forgot about this metric and thought only about worker-online metrics. Ragarding purge_offline_workers I found the code in views/dashboard.py, which is probably not suitable for this changes (I guess metrics deletion would then only happen if the dashboard page is loaded/refreshed). There should probably an extra check in the event handling logic so this happens independently of page loads. Maybe the worker deletion there can just completely be moved to the event handling — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#1128 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AGGVI2MNGWPJMKISHHX7QWLT6IZ5ZANCNFSM5AVVAOKQ>.

karolmartyna · 2021-10-07T14:20:13Z

@Tomasz-Kluczkowski thanks for resolving this problem - really appreciate it.
Does your PR Prometheus: fix offline worker metrics #1135 will be included in the next release?

Tomasz-Kluczkowski · 2021-10-07T21:28:31Z

Hi,

I am not sure as I am not a maintainer and @mher did not express any opinions so far.

I will probably come back to this later.

Prometheus client people suggest to generate the metrics only on scrape which would avoid this issue.

Problem I see with that is - all our metrics are event driven. If I want to compute them on scrape I need to now have some way of knowing what is data to base it on that is new and thread safely accessible.

To me removing labels for offline workers seems more natural.

karolmartyna · 2021-10-08T07:56:37Z

Thanks! I will apply your suggestion and follow future releases.
I thought also about switching from deployment to a statefulset [k8s] which should keep the name of the pod.

HungPhann · 2022-07-21T03:06:05Z

Hi,

I'm using flower==1.0.0 with celery==5.2.6 and still encountering the same issue.
May I know whether it's been fixed yet?

jkranner · 2023-11-28T13:05:56Z

Hi,
I am using flower 2.0.1 with celery 5.3.4 and also encountering the problems mentioned above.
The worker shows as offline in the flower ui, but is reported with flower_worker_online=1 (online) in metrics.
Is there any update on this issue?

xneg added the bug label Jul 20, 2021

drummerwolli mentioned this issue Sep 12, 2022

Emitting metrics of the disabled worker #1221

Closed

zemek mentioned this issue Nov 16, 2022

[sc-60057] Make datadog check resilient to missing worker data headway/datadog-checks#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Offline workers remain in metrics #1128

Offline workers remain in metrics #1128

xneg commented Jul 20, 2021

Tomasz-Kluczkowski commented Jul 25, 2021 •

edited

xneg commented Jul 26, 2021 •

edited

xneg commented Jul 26, 2021 •

edited

Tomasz-Kluczkowski commented Jul 26, 2021 •

edited

xneg commented Jul 26, 2021

beje2k15 commented Aug 23, 2021

Tomasz-Kluczkowski commented Aug 23, 2021 via email

beje2k15 commented Aug 23, 2021

Tomasz-Kluczkowski commented Aug 23, 2021 via email

karolmartyna commented Oct 7, 2021

Tomasz-Kluczkowski commented Oct 7, 2021

karolmartyna commented Oct 8, 2021

HungPhann commented Jul 21, 2022 •

edited

jkranner commented Nov 28, 2023

Offline workers remain in metrics #1128

Offline workers remain in metrics #1128

Comments

xneg commented Jul 20, 2021

Tomasz-Kluczkowski commented Jul 25, 2021 • edited

xneg commented Jul 26, 2021 • edited

xneg commented Jul 26, 2021 • edited

Tomasz-Kluczkowski commented Jul 26, 2021 • edited

xneg commented Jul 26, 2021

beje2k15 commented Aug 23, 2021

Tomasz-Kluczkowski commented Aug 23, 2021 via email

beje2k15 commented Aug 23, 2021

Tomasz-Kluczkowski commented Aug 23, 2021 via email

karolmartyna commented Oct 7, 2021

Tomasz-Kluczkowski commented Oct 7, 2021

karolmartyna commented Oct 8, 2021

HungPhann commented Jul 21, 2022 • edited

jkranner commented Nov 28, 2023

Tomasz-Kluczkowski commented Jul 25, 2021 •

edited

xneg commented Jul 26, 2021 •

edited

xneg commented Jul 26, 2021 •

edited

Tomasz-Kluczkowski commented Jul 26, 2021 •

edited

HungPhann commented Jul 21, 2022 •

edited