Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry backoff on scrape failure to reduce log spam when queries fail? #43

Open
ringerc opened this issue Jun 11, 2024 · 3 comments
Open
Labels
enhancement New feature or request

Comments

@ringerc
Copy link
Collaborator

ringerc commented Jun 11, 2024

Do you have any opinion on the idea of having exponential back-off on re-trying failed metric scrapes to reduce log-spam in case of problems?

If it's an idea you're open to I can look at cooking up a patch to support it if my initial PoC of this scraper works out.

CloudNative-PG's built-in scraper, which I currently use, doesn't do this either. But log-spam is a real problem with it if there's a mistake in a query. So it's something I'd like to see if I can implement here if I adopt this scraper.

@ringerc
Copy link
Collaborator Author

ringerc commented Jun 11, 2024

See also #41, #42

@Vonng
Copy link
Owner

Vonng commented Jun 15, 2024

I think Retry is a great idea. I've noticed that some key metric collection queries occasionally fail sporadically in large-scale production environments, causing monitoring graphs to show short blips.

The rough idea is to add a retry field to the Collector, allowing 1-3 retries. For those system-level critical metrics, we could set a default retry of 1-2 times with a fixed interval (let's say 100ms?) to avoid the aforementioned issue caused by sporadic errors. Of course, retries should only happen if the query fails due to specific retryable errors.

As for exponential backoff, given our current metric scraping method is linearly executed with a single connection, it might not be particularly beneficial. Maybe we can use a fixed value of 100ms as a starting point?

These are just my rough thoughts. I'm currently adapting the monitoring metrics for PostgreSQL 17, so if you're interested in adding any implementation, I'd be more than happy to collaborate.

@Vonng Vonng added the enhancement New feature or request label Jun 15, 2024
@ringerc
Copy link
Collaborator Author

ringerc commented Jun 17, 2024

Interesting - you've described a different problem, where a query sporadically fails but you want to retry it within one scrape.

I'm not concerned with metric gaps here. Rather I'm concerned about a query that's failing spamming the log with verbose errors. Especially if a configuration is deployed with a query that happens to break on some specific combination of postgres version, configuration, etc.

That's why I propose an exponential backoff where the query is temporarily disables (skipped) in future scrapes for some time period. If it's a transient failure the query will resume scraping later. If it's a persistent failure such as a query bug, eventually the query will just run very occasionally and log, so it's possible to see why it's not being collected, but it won't spam the log.

To do this I'm thinking of adding another non-persistent Query field with a back-off duration. On first scrape error the field will be set to an initial back-off value from the query config or a global env-var/cli-arg supplied value as a fallback if unset. On subsequent scrape errors, the backoff is multiplied by (2+-small random value). Each scrape, the back-off is checked against the last executed timestamp. If we're still in a back-off period, the metric scrape is skipped. A new query metric will indicate failure backoff skip. If the backoff period has passed, the query will be re-run. On failure the backoff is increased as described. On success, the backoff is reset to 0.

@ringerc ringerc changed the title Retry backoff on scrape failure? Retry backoff on scrape failure to reduce log spam when queries fail? Jun 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants