(best way to) estimate event size #19781

srstrickland · 2024-02-03T16:36:18Z

srstrickland
Feb 3, 2024

Recently we had a rogue process start spamming megabyte log lines, and needless to say bad things started happening. We installed a quick fix at the agent level to just drop anything whose .message field exceeds some threshold (via strlen), and it got us out of the woods. Maybe in practice, checking the message field is sufficient, since all log messages start there no matter the source. But we have a distributed topology, and don't control all the agents (configs) in our ecosystem, so I need something similar enforced at the "central" layer (which receives data from all the various agents). And technically at this point, it's possible that some upstream agent has already parsed a gigantic message into other fields, so it may be insufficient to check only the message field.

The simplest way I've found to estimate payload size is to just encode as json and take a strlen. Obviously this is an overshot, but since a lot of data is serialized as JSON on the way out, it's not unreasonable. But adding a full serialization just to count the bytes feels like overkill, and I'm wondering if there's a better way. I couldn't find anything in the VRL docs, so at the moment this is all I have:

if (strlen(encode_json(.)) > some_threshold)  # could use length() too

Is there another option? I figure since vector is doing a lot of monitoring around bytes moving through the system, this might be an easy opportunity to expose that information via a function like estimate_size(<any>). It feels like this would be far more efficient than allocating & using memory just for a string to be counted.

Thanks in advance for any pointers!

jszwedko · 2024-02-05T21:11:17Z

jszwedko
Feb 5, 2024
Maintainer

Hey!

That would definitely be one way to do it though, as you note, it would be expensive.

There is actually a trait on events in Vector that exposes a function to estimate the in-memory size of the event:

vector/lib/vector-common/src/byte_size_of.rs

Lines 12 to 32 in 0dce776

 pub trait ByteSizeOf { 

 /// Returns the in-memory size of this type 

 /// 

 /// This function returns the total number of bytes that 

 /// [`std::mem::size_of`] does in addition to any interior allocated 

 /// bytes. It default implementation is `std::mem::size_of` + 

 /// `ByteSizeOf::allocated_bytes`. 

 fn size_of(&self) -> usize { 

 mem::size_of_val(self) + self.allocated_bytes() 

 } 

 /// Returns the allocated bytes of this type 

 /// 

 /// This function returns the total number of bytes that have been allocated 

 /// interior to this type instance. It does not include any bytes that are 

 /// captured by [`std::mem::size_of`] except for any bytes that are interior 

 /// to this type. For instance, `BTreeMap<String, Vec<u8>>` would count all 

 /// bytes for `String` and `Vec<u8>` instances but not the exterior bytes 

 /// for `BTreeMap`. 

 fn allocated_bytes(&self) -> usize; 

 }

As well as another trait to estimate the encoded JSON size of the event:

vector/lib/vector-core/src/event/estimated_json_encoded_size_of.rs

Lines 43 to 45 in 0dce776

 pub trait EstimatedJsonEncodedSizeOf { 

 fn estimated_json_encoded_size_of(&self) -> JsonSize; 

 }

I think these would just need to be exposed in VRL by adding these functions to the vrl::Target trait and then functions added to the VRL stdlib to access them. I think we'd be happy to see a PR for this if you are interested.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(best way to) estimate event size #19781

{{title}}

Replies: 1 comment

{{title}}

Select a reply

(best way to) estimate event size #19781

srstrickland Feb 3, 2024

Replies: 1 comment

jszwedko Feb 5, 2024 Maintainer

srstrickland
Feb 3, 2024

jszwedko
Feb 5, 2024
Maintainer