New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
receive: memory spikes during tenant WAL truncation #7255
Comments
This seems to coincide with intervals when head compaction happens. I think this process acquires a write lock and pending samples pile up in memory. @yeya24 do you see something similar in Cortex? |
Just confirming the compactions happen at the same time as the memory spikes |
did you get context deadline exceeded (500) from ingestors during the WAL compaction? |
Yeah, this optimization is something that needs to be done on Prometheus side :/ I think this is the hot path: https://github.com/prometheus/prometheus/blob/main/tsdb/head.go#L1543-L1554 Some improvements that could be made IMHO: |
Cortex and Mimir solve this by adding jitter between compaction for different tenants. We can disable automatic compaction in the TSDB and manage it ourselves. |
Thanos, Prometheus and Golang version used:
Object Storage Provider: GCS
What happened:
We have several prometheus instances remote writing to a set of 30 receivers. The receivers normally hover around 8GiB of memory, but once every 2 hours the memory spikes up across all receivers at the same time by roughly 20-25%.
And the corresponding WAL truncations across all receivers.
There are other memory spikes that I'm not certain the root cause, like at 6:30 and 9:07. But looking at receiver memory usage over the past 2 weeks, there are consistent spikes when tenant WAL truncations happen.
What you expected to happen:
No memory spikes during WAL truncation, or the ability to stagger when truncation happens.
How to reproduce it (as minimally and precisely as possible):
Unsure, I'm running a fairly standard remote-write + receiver setup. I've raised this in the CNCF Slack and at least one other person has observed the memory spikes as well.
Full logs to relevant components:
Anything else we need to know:
The text was updated successfully, but these errors were encountered: