Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thanos Sidecar - Flush Endpoint #7295

Open
nicolastakashi opened this issue Apr 20, 2024 · 9 comments
Open

Thanos Sidecar - Flush Endpoint #7295

nicolastakashi opened this issue Apr 20, 2024 · 9 comments

Comments

@nicolastakashi
Copy link
Contributor

Is your proposal related to a problem?

On the Prometheus Operator project, we have implemented the ability to autoscale Prometheus shards when running Prometheus using the agent mode. This might be achieved using the native HPA or any tool like Keda.

For the Prometheus Server, we are still working to make this feature available and a lot of discussion is ongoing yet.
For users using Prometheus Server + Thanos Sidecar, I'd like to provide a graceful shutdown by flushing and uploading blocks before shutdown of the Prometheus Server.

I've added the ability to upload blocks to the blob storage using Thanos Tools, we can leverage this on Prometheus Operator by doing some hacks such as a preStop hook to flush tsdb and upload blocks. As we can see we already have feature requests from users implementing this on their own, as you can see here.

This works as expected but it's a bit hacky and I'd like to propose a new feature to the Thanos Sidecar.

Describe the solution you'd like

A new endpoint on the Thanos Sidecar e.g. /flush, /shutdown, or /snapshot (naming is hard 😅).
This brand-new endpoint should execute the following logic when invoked.

  1. Invoke Prometheus TSDB Snapshot Endpoint
  2. Upload Snapshot Blocks to the blob storage
  3. Delete the Snapshot dir after upload (I'm not sure about it)

This new endpoint is only providing an easier experience to graceful shutdown Prometheus Server when some of the following operations are required.

  1. Scaling Down Events
  2. Users Decreasing Disk Size
  3. Prometheus Migration from one place to another. (e.g namespace migration)

By having this new endpoint I envision Prometheus Operator calling it before scaling down the instance, and after this endpoint returns success we can just delete the Statefulset and its Persistent volume.

Describe alternatives you've considered

N/A

Additional context

@nicolastakashi
Copy link
Contributor Author

cc @ArthurSens @simonpasquier

@yeya24
Copy link
Contributor

yeya24 commented Apr 20, 2024

I've added the ability to upload blocks to the blob storage using Thanos Tools, we can leverage this on Prometheus Operator by doing some hacks such as a preStop hook to flush tsdb and upload blocks. As we can see we already have feature requests from users implementing this on their own, as you can see prometheus-operator/prometheus-operator#6540.

If Thanos sidecar implements this API, how would you call it from Prometheus operator? Would you still use the preStop hook to call the API or at downscaling event time?

@nicolastakashi
Copy link
Contributor Author

I've added the ability to upload blocks to the blob storage using Thanos Tools, we can leverage this on Prometheus Operator by doing some hacks such as a preStop hook to flush tsdb and upload blocks. As we can see we already have feature requests from users implementing this on their own, as you can see prometheus-operator/prometheus-operator#6540.

If Thanos sidecar implements this API, how would you call it from Prometheus operator? Would you still use the preStop hook to call the API or at downscaling event time?

The idea is invoke this on the scaling down event.

@yeya24
Copy link
Contributor

yeya24 commented Apr 22, 2024

I am not against adding this functionality into Thanos sidecar.

By having this new endpoint I envision Prometheus Operator calling it before scaling down the instance, and after this endpoint returns success we can just delete the Statefulset and its Persistent volume.

Prometheus operator has its own sidecar like the config-reloader. Why this feature cannot be implemented in another binary deployed as a sidecar as well? The code can be maintained in the Prometheus operator project.

@nicolastakashi
Copy link
Contributor Author

nicolastakashi commented Apr 22, 2024

I am not against adding this functionality into Thanos sidecar.

By having this new endpoint I envision Prometheus Operator calling it before scaling down the instance, and after this endpoint returns success we can just delete the Statefulset and its Persistent volume.

Prometheus operator has its own sidecar like the config-reloader. Why this feature cannot be implemented in another binary deployed as a sidecar as well? The code can be maintained in the Prometheus operator project.

My only concern about having this on projects like config-reloader is because this components will need to learn how to upload blocks to storage account.

Besides that I think might be a useful solution for people is not using Prometheus Operator as well, but I don't have any data about it to confirm it's just feeling 😅

@Nashluffy
Copy link

As someone having to orchestrate "flushing" outside of prometheus-operator currently, I would definitely leverage a /flush endpoint in thanos. I'd also much prefer it as an in-built command than a separate container like config-reloader.

It'd also be much easier to benefit from incremental progress on this, as I'd be able to remove a lot of my custom glue/wiring in my current setup, instead of having to wait for the changes to propagate all the way to prometheus-operator.

@yeya24
Copy link
Contributor

yeya24 commented Apr 22, 2024

As someone having to orchestrate "flushing" outside of prometheus-operator currently, I would definitely leverage a /flush endpoint in thanos. I'd also much prefer it as an in-built command than a separate container like config-reloader.

Once this functionality has been added to the Prometheus operator sidecar, there should be no difference in terms of UX compared to having the API natively?

But I think I am okay to add the new API. PR is welcomed.

@MichaHoffmann
Copy link
Contributor

There was some idea to run prometheus via libraries from within sidecar; that way we could build this properly. cc @SuperQ

@Nashluffy
Copy link

There was some idea to run prometheus via libraries from within sidecar; that way we could build this properly. cc @SuperQ

I've raised a PR that flushes via TSDB snapshotting, but happy to give this a try if you can point me in a general direction. IIUC we could change the implementation of the flush to what you proposed in the future without impacting end users, too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants