Retry backoff on scrape failure to reduce log spam when queries fail? #43

ringerc · 2024-06-11T21:50:58Z

Do you have any opinion on the idea of having exponential back-off on re-trying failed metric scrapes to reduce log-spam in case of problems?

If it's an idea you're open to I can look at cooking up a patch to support it if my initial PoC of this scraper works out.

CloudNative-PG's built-in scraper, which I currently use, doesn't do this either. But log-spam is a real problem with it if there's a mistake in a query. So it's something I'd like to see if I can implement here if I adopt this scraper.

ringerc · 2024-06-11T21:53:59Z

See also #41, #42

Vonng · 2024-06-15T06:07:58Z

I think Retry is a great idea. I've noticed that some key metric collection queries occasionally fail sporadically in large-scale production environments, causing monitoring graphs to show short blips.

The rough idea is to add a retry field to the Collector, allowing 1-3 retries. For those system-level critical metrics, we could set a default retry of 1-2 times with a fixed interval (let's say 100ms?) to avoid the aforementioned issue caused by sporadic errors. Of course, retries should only happen if the query fails due to specific retryable errors.

As for exponential backoff, given our current metric scraping method is linearly executed with a single connection, it might not be particularly beneficial. Maybe we can use a fixed value of 100ms as a starting point?

These are just my rough thoughts. I'm currently adapting the monitoring metrics for PostgreSQL 17, so if you're interested in adding any implementation, I'd be more than happy to collaborate.

ringerc · 2024-06-17T23:51:51Z

Interesting - you've described a different problem, where a query sporadically fails but you want to retry it within one scrape.

I'm not concerned with metric gaps here. Rather I'm concerned about a query that's failing spamming the log with verbose errors. Especially if a configuration is deployed with a query that happens to break on some specific combination of postgres version, configuration, etc.

That's why I propose an exponential backoff where the query is temporarily disables (skipped) in future scrapes for some time period. If it's a transient failure the query will resume scraping later. If it's a persistent failure such as a query bug, eventually the query will just run very occasionally and log, so it's possible to see why it's not being collected, but it won't spam the log.

To do this I'm thinking of adding another non-persistent Query field with a back-off duration. On first scrape error the field will be set to an initial back-off value from the query config or a global env-var/cli-arg supplied value as a fallback if unset. On subsequent scrape errors, the backoff is multiplied by (2+-small random value). Each scrape, the back-off is checked against the last executed timestamp. If we're still in a back-off period, the metric scrape is skipped. A new query metric will indicate failure backoff skip. If the backoff period has passed, the query will be re-run. On failure the backoff is increased as described. On success, the backoff is reset to 0.

ringerc mentioned this issue Jun 11, 2024

Selective scrape via query parameters #42

Open

Vonng added the enhancement New feature or request label Jun 15, 2024

ringerc changed the title ~~Retry backoff on scrape failure?~~ Retry backoff on scrape failure to reduce log spam when queries fail? Jun 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry backoff on scrape failure to reduce log spam when queries fail? #43

Retry backoff on scrape failure to reduce log spam when queries fail? #43

ringerc commented Jun 11, 2024 •

edited

Loading

ringerc commented Jun 11, 2024

Vonng commented Jun 15, 2024

ringerc commented Jun 17, 2024

Retry backoff on scrape failure to reduce log spam when queries fail? #43

Retry backoff on scrape failure to reduce log spam when queries fail? #43

Comments

ringerc commented Jun 11, 2024 • edited Loading

ringerc commented Jun 11, 2024

Vonng commented Jun 15, 2024

ringerc commented Jun 17, 2024

ringerc commented Jun 11, 2024 •

edited

Loading