You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Over the last 30 days, we've experienced ~265 instances where backpressure has been marked as unhealthy due to a connection timeout when checking the health of a redis or rabbitmq cluster: https://cloudlogging.app.goo.gl/KNZDAduqrHWQn5At7
Each of these come with a corresponding pause and delay in ingestion:
1 timeout seems to trigger about 15s of ingestion latency.
There can also be instances where multiple trigger in succession, which seems to be enough to trigger a backlog large enough that it may page SRE while it burns down the backlog:
Expected Result
Some possible improvements we can make:
Add some retry functionality to avoid flakes
Require multiple events in a row to trigger the unhealthy state
Have backpressure fail open instead of closed (could have negative impact if the failures are caused by a real outage of a cluster).
I would probably start with adding retries on failure as it seems like the simplest thing that can work.
Actual Result
Backpressure pauses ingestion from a single failure.
Product Area
Ingestion and Filtering
Link
No response
DSN
No response
Version
No response
The text was updated successfully, but these errors were encountered:
Environment
SaaS (https://sentry.io/)
Steps to Reproduce
Over the last 30 days, we've experienced ~265 instances where backpressure has been marked as unhealthy due to a connection timeout when checking the health of a redis or rabbitmq cluster: https://cloudlogging.app.goo.gl/KNZDAduqrHWQn5At7
Each of these come with a corresponding pause and delay in ingestion:
1 timeout seems to trigger about 15s of ingestion latency.
There can also be instances where multiple trigger in succession, which seems to be enough to trigger a backlog large enough that it may page SRE while it burns down the backlog:
Expected Result
Some possible improvements we can make:
I would probably start with adding retries on failure as it seems like the simplest thing that can work.
Actual Result
Backpressure pauses ingestion from a single failure.
Product Area
Ingestion and Filtering
Link
No response
DSN
No response
Version
No response
The text was updated successfully, but these errors were encountered: