Backpressure: overly susceptible to temporary issues #70034

mwarkentin · 2024-05-01T14:51:23Z

Environment

Steps to Reproduce

Over the last 30 days, we've experienced ~265 instances where backpressure has been marked as unhealthy due to a connection timeout when checking the health of a redis or rabbitmq cluster: https://cloudlogging.app.goo.gl/KNZDAduqrHWQn5At7

Each of these come with a corresponding pause and delay in ingestion:

1 timeout seems to trigger about 15s of ingestion latency.

There can also be instances where multiple trigger in succession, which seems to be enough to trigger a backlog large enough that it may page SRE while it burns down the backlog:

Expected Result

Some possible improvements we can make:

Add some retry functionality to avoid flakes
Require multiple events in a row to trigger the unhealthy state
Have backpressure fail open instead of closed (could have negative impact if the failures are caused by a real outage of a cluster).

I would probably start with adding retries on failure as it seems like the simplest thing that can work.

Actual Result

Backpressure pauses ingestion from a single failure.

Product Area

Ingestion and Filtering

Link

No response

DSN

No response

Version

No response

loewenheim · 2024-05-02T09:47:38Z

I agree with adding retries. Making failure to check not count as unhealthy sounds dicey to me, on the other hand, for the reason you mention.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backpressure: overly susceptible to temporary issues #70034

Backpressure: overly susceptible to temporary issues #70034

mwarkentin commented May 1, 2024

loewenheim commented May 2, 2024

Backpressure: overly susceptible to temporary issues #70034

Backpressure: overly susceptible to temporary issues #70034

Comments

mwarkentin commented May 1, 2024

Environment

Steps to Reproduce

Expected Result

Actual Result

Product Area

Link

DSN

Version

loewenheim commented May 2, 2024