Issue with OOM Killer occurring when creating a massive stream #1002
Replies: 5 comments
-
2 areas are mentioned in your post:
In order to provide better durability and performance, OpenObserve stores data with minimal processing in WAL. This process is very fast. It allows for quickly ingesting large amounts of data. Much higher than it would have been possible with real time processing and allows for any short bursts of incoming data to be handled well. WAL is stored in JSON files currently (We are working on to change the WAL format to a more efficient format which should be available in subsequent releases). WAL also allows batching of data before pushing on to object storage. Batching, conversion to parquet, compression and moving of data to object store is comparatively more compute intensive and happens asynchronously. What you experienced was that you pushed OpenObserve a lot more data than it could handle with available hardware resources. OpnObserve tried to handle it with WAL for quite a while but crumbled at some point. Disk speeds can also hamper WAL creation and movement and can be a bottleneck. If performance is your priority then you can enable memory based WAL in OpenObserve. It's a tradeoff of performance vs durability though. Search is done by loading data in memory for faster retrieval . We try to use 50% of available memory (configurable by env variable) on the machine for this as of v0.4.7 . 50% memory for search + 50%+ for batching, converting, compressing and moving could have caused OOM. We have made improvements in this area that should be available in the coming release.
Not really. You can have petabytes of data in a single stream. What is important is how much data are you processing at any given point in time. |
Beta Was this translation helpful? Give feedback.
-
Thanks for the answer. The movement was generally as expected. However, I'm curious about what the search speed depends on when a large size is accumulated in a single stream. When there is a storage of 80GB (compressed: 3.5G), searching from all records takes about 20-30 seconds. Can this be improved only by improving disk IOPS and CPU performance? Or should the stream be divided? There seems to be a feature called partitioning key, but since there is no explanation yet, I don't fully understand how it works. Judging from the structure of the data directory when partitioning is enabled, I think it is a feature for speeding up searches... like the primary key in ClickHouse? Naturally, adding search conditions such as time can improve the speed. |
Beta Was this translation helpful? Give feedback.
-
This is correct. Search performance depends on the amount of data being searched on. If you look at the way files are stored physically. You will notice:
Your search performance depends on how much data you are scanning through a given query. If you add a filter to search only for a specific day as opposed to the whole year then your search performance will be better since you are scanning less data. In organizations where large amount of data is flowing in, you can create additional partitions -e.g. host_name . This will allow you to have conditions based on the particular host_name e.g. host_name=host1 . This will reduce the amount of data being scanned and will improve search performance. You can have multiple partitions. Make sure that when you are creating partitions don't let each data file become too small. Also if you are using s3 then you would want to not have to small files as you pay for each read and write and reading 1 kb file and 100 MB file costs same. As a general rule of thumb try to have each file between 5-15 MB. |
Beta Was this translation helpful? Give feedback.
-
Thank you. I understood about the inner workings. Sorry for all the questions, but one last thing. |
Beta Was this translation helpful? Give feedback.
-
Data is stored in memory in compressed bytes (Size of actual files). e.g. You ingested 1 TB of logs. It is 30 GB compressed. The data stored in RAM will be 30 GB (Depends on the amount of RAM you have. We use 50% of RAM as cache...). When you run a query like |
Beta Was this translation helpful? Give feedback.
-
I am currently testing how much log processing my low-spec machine can handle. The machine has 4 cores and 4GB of memory. I am using OpenObserver v0.4.7.
I sent approximately 10,000 messages per second and continued to accumulate logs in a single stream. At around 80GB in actual size and approximately 3.5GB after compression, an OOM Killer event occurred when I attempted a search.
At that time, the WAL size had become huge, and it seemed that the saving process to the stream was not working properly. Even after stopping the log sending, the situation did not change.
Restarting the process improved the situation, but when I performed a search while saving from WAL to the stream, the OOM Killer event occurred again. There have been no issues once the WAL is empty.
Is this the intended behavior? Also, is there a specific limit to the size of a single stream in relation to machine memory?
Note: The number of messages is 2.5 billion.
Beta Was this translation helpful? Give feedback.
All reactions