Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

receive: memory spikes during tenant WAL truncation #7255

Open
Nashluffy opened this issue Apr 2, 2024 · 5 comments
Open

receive: memory spikes during tenant WAL truncation #7255

Nashluffy opened this issue Apr 2, 2024 · 5 comments

Comments

@Nashluffy
Copy link

Thanos, Prometheus and Golang version used:

thanos, version 0.34.1 (branch: HEAD, revision: 4cf1559998bf6d8db3f9ca0fde2a00d217d4e23e)
  build user:       root@61db75277a55
  build date:       20240219-17:13:48
  go version:       go1.21.7
  platform:         linux/amd64
  tags:             netgo

Object Storage Provider: GCS

What happened:
We have several prometheus instances remote writing to a set of 30 receivers. The receivers normally hover around 8GiB of memory, but once every 2 hours the memory spikes up across all receivers at the same time by roughly 20-25%.

Screenshot 2024-04-02 at 19 48 48

And the corresponding WAL truncations across all receivers.

image

There are other memory spikes that I'm not certain the root cause, like at 6:30 and 9:07. But looking at receiver memory usage over the past 2 weeks, there are consistent spikes when tenant WAL truncations happen.

What you expected to happen:
No memory spikes during WAL truncation, or the ability to stagger when truncation happens.

How to reproduce it (as minimally and precisely as possible):
Unsure, I'm running a fairly standard remote-write + receiver setup. I've raised this in the CNCF Slack and at least one other person has observed the memory spikes as well.

Full logs to relevant components:

Anything else we need to know:

@fpetkovski
Copy link
Contributor

This seems to coincide with intervals when head compaction happens. I think this process acquires a write lock and pending samples pile up in memory. @yeya24 do you see something similar in Cortex?

@Nashluffy
Copy link
Author

This seems to coincide with intervals when head compaction happens. I think this process acquires a write lock and pending samples pile up in memory. @yeya24 do you see something similar in Cortex?

Just confirming the compactions happen at the same time as the memory spikes

image

@jnyi
Copy link
Contributor

jnyi commented Apr 10, 2024

did you get context deadline exceeded (500) from ingestors during the WAL compaction?

@GiedriusS
Copy link
Member

Yeah, this optimization is something that needs to be done on Prometheus side :/ I think this is the hot path: https://github.com/prometheus/prometheus/blob/main/tsdb/head.go#L1543-L1554

Some improvements that could be made IMHO:
prometheus/prometheus#13642
prometheus/prometheus#13632

@fpetkovski
Copy link
Contributor

Cortex and Mimir solve this by adding jitter between compaction for different tenants. We can disable automatic compaction in the TSDB and manage it ourselves.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants