-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry backoff on scrape failure to reduce log spam when queries fail? #43
Comments
I think Retry is a great idea. I've noticed that some key metric collection queries occasionally fail sporadically in large-scale production environments, causing monitoring graphs to show short blips. The rough idea is to add a retry field to the Collector, allowing 1-3 retries. For those system-level critical metrics, we could set a default retry of 1-2 times with a fixed interval (let's say 100ms?) to avoid the aforementioned issue caused by sporadic errors. Of course, retries should only happen if the query fails due to specific retryable errors. As for exponential backoff, given our current metric scraping method is linearly executed with a single connection, it might not be particularly beneficial. Maybe we can use a fixed value of 100ms as a starting point? These are just my rough thoughts. I'm currently adapting the monitoring metrics for PostgreSQL 17, so if you're interested in adding any implementation, I'd be more than happy to collaborate. |
Interesting - you've described a different problem, where a query sporadically fails but you want to retry it within one scrape. I'm not concerned with metric gaps here. Rather I'm concerned about a query that's failing spamming the log with verbose errors. Especially if a configuration is deployed with a query that happens to break on some specific combination of postgres version, configuration, etc. That's why I propose an exponential backoff where the query is temporarily disables (skipped) in future scrapes for some time period. If it's a transient failure the query will resume scraping later. If it's a persistent failure such as a query bug, eventually the query will just run very occasionally and log, so it's possible to see why it's not being collected, but it won't spam the log. To do this I'm thinking of adding another non-persistent |
Do you have any opinion on the idea of having exponential back-off on re-trying failed metric scrapes to reduce log-spam in case of problems?
If it's an idea you're open to I can look at cooking up a patch to support it if my initial PoC of this scraper works out.
CloudNative-PG's built-in scraper, which I currently use, doesn't do this either. But log-spam is a real problem with it if there's a mistake in a query. So it's something I'd like to see if I can implement here if I adopt this scraper.
The text was updated successfully, but these errors were encountered: