-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
influxdb_v2_listener rate limit #15340
Comments
Hi, Happy to see a PR for this however it absolutely must be something that is opt-in to avoid changing the behavior for others who do not expect this behavior. |
@powersj will this actually work as I think it will? Is it possible to send a 204 unless the max undelivered messages limit is hit? I am looking through the tracking code, but I am not 100% whether I understand it correctly. |
I wrote my message to you and then I stopped and wondered why we had not done this before while trying to wake up. With the other plugins that use tracking metrics the sender has some sort of QOS or other behavior that keep track of messages. This allows Telegraf the ability to callback once the metric is successfully sent to the output. In the case of the influxdb listener, you are essentially an HTTP client sending data. We can't easily wait till a flush interval to send the 204 once the data is flushed. In that sense this really isn't going to work. We can keep track of how many metrics we have and return the 429, but we aren't really guaranteeing that the metrics are arriving. I am more inclined to point at Dane's WAL PR as a possible solution for you to avoid buffer overflows and greater durability. Or consider an aggregators to reduce the total metrics if possible. |
Yeah, I think this is not really what tracking metrics are for. I think some kind of back pressure mechanism would be better suited. I think the issue should stay open, but it should not be implemented using tracking metrics. Are there any plans for this? I think it would be a lot of work to add this. |
I would like to have a set of next steps then. If we did add some sort of generic rate limit to the input, in general why else would a user artificially limit the number of metrics Telegraf could handle? In your very specific use-case I do think the real answer is using some other buffer mechanism (i.e. #802 via a WAL) versus trying to force the issue to the input. |
I think backpressure is different from a WAL buffer in that it can be used when Telegraf is used to process a batch of metrics. Aside from my use case, I think another use case might be when you use inputs.file or inputs.multifile and want to read a large file (order of gigabytes). Reading the entire file into memory and then again storing them in a buffer file will use a lot of disk space/memory that is not needed when you process the file part by part, rate limited by the backpressure. |
How is that different than what we do today? The key difference is that your use-case is memory bound, right? In that case, using something other than memory, the WAL, to store the metrics is our proposed solution.
Telegraf can do this today, again assuming you have memory available. My personal preference in these situations, where you are also limited to buffer size, is that Telegraf is not the right answer. Instead, you should be using something else to read and batch up the file or do the parsing for you. For example, in the case of InfluxDB, using the client libraries tends to be a much better and durable solution. |
That would solve the problem with regards to memory, but it will still double(-ish) the disk IO and space usage, until all the metrics have been flushed. In the case of parsing a large file, it would be more efficient to parse the original file part by part. By using backpressure, you can control how fast the data is read.
Interesting. I actually had a strong preference to use Telegraf. We already use it as an ETL tool so we can transform data coming from all kinds of systems into a standard format. Another possible applicationAnother possible plugin that might benefit from backpressure is the inputs.socker_listener plugin that listens on a TCP socket. TCP has flow control. When backpressure is available in Telegraf, the inputs.socket_listener plugin could slow down the TCP ACK packets causing the connection to slow down. Another possible implementationLooking at the inputs.tail plugin, it looks like back pressure is already implemented using the tracking metrics. The only reason why it wouldn't work for Another reasonable implementation, might be to send a response immediately. If the |
Exactly - To get around this, what we have used with customers before is to have them put kafka between the source and Telegraf. Then they get the use of tracking metrics and a way to slow things down for Telegraf. I'm not sure that helps in your situation if you are resource constrained as it just adds one more place for metrics to pile up. edit: finish thought in last sentence. |
I have created this implementation in #15361 |
Use Case
We have an Influx DB OSS edge replication running on a satellite connection. The connection is not great and the way the replication is implemented exacerbates the problems. Edge replication sets up a new connection for every batch that is written to the replicated bucket. This means that replication happens like this:
When we write a lot of small batches, the replication slows down tremendously and we struggle to keep up.
I was planning to work around this by setting up something like this:
This way, Telegraf takes care of increasing the batch sizes. In addition, outputs.influxdb_v2 reuses the TCP connection, so the setup time will become much less of a problem.
Expected behavior
What is missing for this to work 100% of the time, is rate limiting on the inputs.influxdb_v2_listener side. If the satellite connection is completely gone, which happens every so often, I don't want the Influx replication to keep pushing metrics into Telegraf while Telegraf cannot write them to the destination. This results in a buffer overflow and dropped metrics.
I think some kind of back pressure mechanism is needed for this to work. The plugin should then return a 429 status to rate limit the writing when a buffer reaches its limit.
Actual behavior
Telegraf drops metrics when the buffer overflows.
Additional info
No response
The text was updated successfully, but these errors were encountered: