Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Offline workers remain in metrics #1128

Open
xneg opened this issue Jul 20, 2021 · 14 comments
Open

Offline workers remain in metrics #1128

xneg opened this issue Jul 20, 2021 · 14 comments
Labels

Comments

@xneg
Copy link

xneg commented Jul 20, 2021

Describe the bug
We are deploing celery workers in k8s. Flower instance is also deployed in a separate pod. After redeploy, when one pod is killed and new pod is created, worker that gone still remains in metrics.
There is only one online worker in UI (with purge_offline_workers enabled) but all offline workers are still listed in metrics.

# HELP flower_worker_online Worker online status
# TYPE flower_worker_online gauge
flower_worker_online{worker="celery@superset-worker-788b99d877-6xm2h"} 1.0
flower_worker_online{worker="celery@superset-worker-787cd495dc-kjvt5"} 1.0
flower_worker_online{worker="celery@superset-worker-7d579647b9-hhf7c"} 1.0
flower_worker_online{worker="celery@superset-worker-75556c6c58-l46cm"} 1.0
flower_worker_online{worker="celery@superset-worker-7d579647b9-lpcfd"} 1.0
flower_worker_online{worker="celery@superset-worker-6f76c5ddf6-tl8bk"} 1.0
flower_worker_online{worker="celery@superset-worker-6c979f7b6f-4m65t"} 1.0

Looks like flower does not receive worker-offline event.

To Reproduce
Steps to reproduce the behavior:

  1. Run celery worker with command celery worker --app=superset.tasks.celery_app:app -Ofair -l INFO
  2. There is config for celery
class CeleryConfig(object):
    BROKER_URL = f"redis://{REDIS_HOST}:{REDIS_PORT}/{REDIS_CELERY_DB}"
  1. Run flower with command celery --broker=redis://${REDIS_HOST}:${REDIS_PORT}/${REDIS_RESULTS_DB} flower --purge_offline_workers=10
  2. See in metrics
    flower_worker_online{worker="celery@superset-worker-788b99d877-6xm2h"} 1.0
  3. Restart worker pod.
  4. See in metrics
flower_worker_online{worker="celery@superset-worker-788b99d877-6xm2h"} 1.0
flower_worker_online{worker="celery@superset-worker-787cd495dc-kjvt5"} 1.0

Expected behavior
Online workers in metrics should respect the actual worker's state.

Screenshots
Screenshot 2021-07-20 at 14 53 14

@xneg xneg added the bug label Jul 20, 2021
@Tomasz-Kluczkowski
Copy link
Contributor

Tomasz-Kluczkowski commented Jul 25, 2021

Hi @xneg,

This is on my list as I have similar issue in our k8s deployment. I am aiming at re-using the purge-offline-workers CLI arg for this or some smarter way of achieving this (suggestions are welcome). Currently purge-offline-workers is not being used for anything Prometheus related.

For us in our prod cluster it is when the worker goes offline, it stays offline in Prometheus metric (as that is the last metric produced by Flower for this particular worker) when in fact there is a new pod that replaced it already.

It is just Flower does not understand pod replacement (and never will), so we have to figure out how to clear the offline workers that are hanging.

For us we do not have the problem with no offline signal but our workers are started with -E option which makes sure all events are sent - could you check if that changes the behaviour for you guys please?

I will also look at what happens when I run the worker the way you do and see if the way Flower works can be improved.

I think if you enable -E option for your workers - this time the workers from old deployments will show as offline forever (https://docs.celeryproject.org/en/stable/reference/cli.html#cmdoption-celery-worker-E) because Flower does not understand that they got replaced by a restarted / redeployed pod.

One suggestion for a work-around temporarily guys had at my place was enforce re-boot of Flower on each deployment (by maybe having always changing label or something like that) but then we may lose continuity of measurements/tasks data which is not something I like and it does not help when a monitored worker simply dies due to OOM or other stuff.

Anyway - thx for reporting and I want to get this sorted asap as I need it working better at my place but you know - it's holidays, will take some time...

Suggestions are welcome.

@mher Please assign this one to me.

@xneg
Copy link
Author

xneg commented Jul 26, 2021

Hello @Tomasz-Kluczkowski Thank you for your response!

Enabling option -E does not help.
So, again our steps:

  1. Run worker on pod with command celery --app=superset.tasks.celery_app:app worker -E
  2. Kill pod with kubectl delete pods <pod_name>
  3. After that we have +1 worker in metrics.

I still think flower does not receive worker-offline event.
But I noticed another issue: we run our workers with celery==4.4.7 but flower requires celery >= 5.0.5.
Maybe this is the case?

@xneg
Copy link
Author

xneg commented Jul 26, 2021

I can't figure out how to debug it right now, so what I write next is just an assumption.
Looks like workers do not send online/offline events. And all the knowledge flower has about workers comes only from heartbeats. With purge_offline_workers you can filter workers without heartbeat but this does not work now for metrics.

@Tomasz-Kluczkowski
Copy link
Contributor

Tomasz-Kluczkowski commented Jul 26, 2021

This is a wider problem as all metrics are affected by it in docker/kubernetes.

The final metric data point registered remains for that label forever.

This is for worker online as well as for example number of tasks prefetched where the last value of tasks prefetched will be shown in grafana's graph forever.

My thoughts so far:

The only way I see it fixed is once a worker is considered offline for long enough (so re-purposing the purge offline workers cli arg) I have to remove all data associated with that worker from ALL Prometheus metrics, not only from the worker online one.

This means some data loss (but it is for a worker that is no longer in existence) but atm I do not see a nicer way of doing it for k8s.

Counter argument to that is that we may want to know what was happening to that worker before it went offline for good...

I will do some reading on python Prometheus client and see what is the best approach but if anyone has better ideas - give us a shout.

@xneg
Copy link
Author

xneg commented Jul 26, 2021

I think we should divide gauge and counter metrics.
Counter metrics tend to increase. E.g. I want to see overall done tasks per day including that from offline workers.
But gauge metrics represent current state and they should respect the real context.

The only way I see it fixed is once a worker is considered offline for long enough (so re-purposing the purge offline workers cli arg) I have to remove all data associated with that worker from ALL Prometheus metrics, not only from the worker online one.

I see implementation problem here. purge-offline-workers works when you get data through HTTP API. This setting just filters out stale workers. You can't decrease metric cause you don't catch any event that worker is offline. To implement the same behavior with Prometheus metrics you should inject this logic in Prometheus endpoint. This is not what comes out of the box but it is possible.

@beje2k15
Copy link

I'm struggeling with the same issue, but delayed looking into it until I upgraded to celery 5. As this is done I looked into this issue now and wanted to share my thoughts:

I think the main reason for this issue it that flower/celery isn't quite fit for k8s. Especially flower still handles worker like they are manually setup on nodes, in terms that it expects worker to come back online later. With ReplicaSets in k8s however this isn't the case as a new pod will have a new randon name. A way to work around this would be to deploy all workers with a StatefulSet which will cause all pods to have preditable names and a restarted pod will inherits the name of the old pod it replaces (e.g. worker-3 crashes so a new pod names worker-3 will replace the old one)

While this is a solution which can be implemented right now without any changes in flower, I think it isn't the best solution as we don't really need any state aside from the preditable hostname and therfore a worker is better setup using a ReplicaSet. Therefore flower should support in some manner never reappearing workers.

In flower/events.py is code which handles worker-offline events and metrics are only updated to mark the worker as offline. However it is also possible to delete a metric:

>>> from prometheus_client import Gauge
>>> # Add a gauge metrics
>>> g = Gauge("test", "Test", ["instance"])
>>> # set some values, like worker online
>>> g.labels(instance="1").set(1)
>>> g.labels(instance="2").set(1)
>>> # test it
>>> g.collect()
[Metric(test, Test, gauge, , [Sample(name='test', labels={'instance': '1'}, value=1.0, timestamp=None, exemplar=None), Sample(name='test', labels={'instance': '2'}, value=1.0, timestamp=None, exemplar=None)])]
>>> # current implementation of worker-offline
>>> g.labels(instance="2").set(0)
>>> g.collect()
[Metric(test, Test, gauge, , [Sample(name='test', labels={'instance': '1'}, value=1.0, timestamp=None, exemplar=None), Sample(name='test', labels={'instance': '2'}, value=0.0, timestamp=None, exemplar=None)])]
>>> # additionally the metric can be deleted completly:
>>> g.remove("2")
>>> g.collect()
[Metric(test, Test, gauge, , [Sample(name='test', labels={'instance': '1'}, value=1.0, timestamp=None, exemplar=None)])]
>>> 

Just adding a self.metrics.worker_online.remove(worker_name) in the event handling probably doesn't cut it. The metric would vanish before a value of 0 could be scraped. Any alerting setup to match offline workers would fail to notice the worker going offline. Probably the best solution would be to add the remove call to the cleanup code which is controlled by purge_offline_workers . This way a worker would go offline and after the timeout set with purge_offline_workers the metric would vanish and stop cluttering dashboards.

At the moment I'm not familiar enough with the codebase to propose a pull request. Perhaps I could do this later, but I would like to hear your opinion(s) in this matter first.

@Tomasz-Kluczkowski
Copy link
Contributor

Tomasz-Kluczkowski commented Aug 23, 2021 via email

@beje2k15
Copy link

I'll try to dig into it after my vacation.

The Isse fpr prometheus-client would be handy for flower_events_total. In my first post I totally forgot about this metric and thought only about worker-online metrics.

Ragarding purge_offline_workers I found the code in views/dashboard.py, which is probably not suitable for this changes (I guess metrics deletion would then only happen if the dashboard page is loaded/refreshed). There should probably an extra check in the event handling logic so this happens independently of page loads. Maybe the worker deletion there can just completely be moved to the event handling

@Tomasz-Kluczkowski
Copy link
Contributor

Tomasz-Kluczkowski commented Aug 23, 2021 via email

@karolmartyna
Copy link

@Tomasz-Kluczkowski thanks for resolving this problem - really appreciate it.
Does your PR Prometheus: fix offline worker metrics #1135 will be included in the next release?

@Tomasz-Kluczkowski
Copy link
Contributor

Hi,

I am not sure as I am not a maintainer and @mher did not express any opinions so far.

I will probably come back to this later.

Prometheus client people suggest to generate the metrics only on scrape which would avoid this issue.

Problem I see with that is - all our metrics are event driven. If I want to compute them on scrape I need to now have some way of knowing what is data to base it on that is new and thread safely accessible.

To me removing labels for offline workers seems more natural.

@karolmartyna
Copy link

Thanks! I will apply your suggestion and follow future releases.
I thought also about switching from deployment to a statefulset [k8s] which should keep the name of the pod.

@HungPhann
Copy link

HungPhann commented Jul 21, 2022

Hi,

I'm using flower==1.0.0 with celery==5.2.6 and still encountering the same issue.
May I know whether it's been fixed yet?

@jkranner
Copy link

Hi,
I am using flower 2.0.1 with celery 5.3.4 and also encountering the problems mentioned above.
The worker shows as offline in the flower ui, but is reported with flower_worker_online=1 (online) in metrics.
Is there any update on this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants