receive: memory spikes during tenant WAL truncation #7255

Nashluffy · 2024-04-02T18:59:41Z

Thanos, Prometheus and Golang version used:

thanos, version 0.34.1 (branch: HEAD, revision: 4cf1559998bf6d8db3f9ca0fde2a00d217d4e23e)
  build user:       root@61db75277a55
  build date:       20240219-17:13:48
  go version:       go1.21.7
  platform:         linux/amd64
  tags:             netgo

Object Storage Provider: GCS

What happened:
We have several prometheus instances remote writing to a set of 30 receivers. The receivers normally hover around 8GiB of memory, but once every 2 hours the memory spikes up across all receivers at the same time by roughly 20-25%.

And the corresponding WAL truncations across all receivers.

There are other memory spikes that I'm not certain the root cause, like at 6:30 and 9:07. But looking at receiver memory usage over the past 2 weeks, there are consistent spikes when tenant WAL truncations happen.

What you expected to happen:
No memory spikes during WAL truncation, or the ability to stagger when truncation happens.

How to reproduce it (as minimally and precisely as possible):
Unsure, I'm running a fairly standard remote-write + receiver setup. I've raised this in the CNCF Slack and at least one other person has observed the memory spikes as well.

Full logs to relevant components:

Anything else we need to know:

The text was updated successfully, but these errors were encountered:

fpetkovski · 2024-04-02T19:41:50Z

This seems to coincide with intervals when head compaction happens. I think this process acquires a write lock and pending samples pile up in memory. @yeya24 do you see something similar in Cortex?

Nashluffy · 2024-04-02T21:04:25Z

This seems to coincide with intervals when head compaction happens. I think this process acquires a write lock and pending samples pile up in memory. @yeya24 do you see something similar in Cortex?

Just confirming the compactions happen at the same time as the memory spikes

jnyi · 2024-04-10T23:27:54Z

did you get context deadline exceeded (500) from ingestors during the WAL compaction?

GiedriusS · 2024-04-29T08:17:26Z

Yeah, this optimization is something that needs to be done on Prometheus side :/ I think this is the hot path: https://github.com/prometheus/prometheus/blob/main/tsdb/head.go#L1543-L1554

Some improvements that could be made IMHO:
prometheus/prometheus#13642
prometheus/prometheus#13632

fpetkovski · 2024-04-29T09:06:55Z

Cortex and Mimir solve this by adding jitter between compaction for different tenants. We can disable automatic compaction in the TSDB and manage it ourselves.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

receive: memory spikes during tenant WAL truncation #7255

receive: memory spikes during tenant WAL truncation #7255

Nashluffy commented Apr 2, 2024

fpetkovski commented Apr 2, 2024

Nashluffy commented Apr 2, 2024

jnyi commented Apr 10, 2024

GiedriusS commented Apr 29, 2024

fpetkovski commented Apr 29, 2024

receive: memory spikes during tenant WAL truncation #7255

receive: memory spikes during tenant WAL truncation #7255

Comments

Nashluffy commented Apr 2, 2024

fpetkovski commented Apr 2, 2024

Nashluffy commented Apr 2, 2024

jnyi commented Apr 10, 2024

GiedriusS commented Apr 29, 2024

fpetkovski commented Apr 29, 2024