-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High memory consumption during long running jobs #3790
Comments
We are facing same issue. When number of steps in a job increases it leads to OOM, killing the manager jvm. |
@PauliusPeciura Thank you for reporting this issue and for opening a PR! I would like to be able to reproduce the issue first in order to validate a fix if any. From your usage of @ssanghavi-appdirect Yes. If we can reproduce the issue in a reliable manner, we will plan a fix for one of the upcoming releases. |
@benas I am able to reproduce with Scenario to reproduce: create a job with more than 900 remote partitions, wait for it to complete. Observe that manager jvm fails with OOM if -Xmx is set else memory consumption keeps on increasing. What is the most convenient way to share code that reproduces issue? |
Attaching a spring boot project that can reproduce issue with Steps to execute the program
|
Thank you all for for your feedback here! This is a valid performance issue. There is definitely no need to load the entire object graph of step executions when polling the status of workers. Ideally, polling for running workers could be done with a single query, and once they are all done, we should grab shallow copies of step executions with the minimum required to do the aggregation. I will plan the fix for the upcoming 5.0.1 / 4.3.8. |
@fmbenhassine I'm afraid the issue is still present. I've checked the commit you made but since it's still working with entities, the associations are still there. Here's a snapshot from a heap dump I've taken: And here's the relevant stacktrace where the objects are coming from:
Note: this specific job could run for hours and processes a lot of data (millions of records). When the number of partitions exceed 500 (not the threshold) the manager is slowly accumulating more and more memory. |
@galovics Thank you for reporting this.
We will always work with entities according to the domain model. What we can do is reduce the number of entities loaded in memory to the minimum required. Before 93800c6, the code was loading job executions in a loop for every partitioned step execution, which is obviously not necessary. In your screenshot, I see you have several To correctly address any performance issue, we need to analyse the performance for a single job execution first. So I am expecting to see a single job execution in memory with a partitioned step. Once we ensure that a single partitioned execution is optimized, we can discuss if the packaging/deployment pattern is suitable to run several job executions in the same JVM or not. Please open a separate issue and provide a minimal example to be sure we are addressing your specific issue and we will dig deeper. Thank you upfront. |
That's strange to me too. I re-read the Spring Batch docs on job instances to use the same terminology and understanding and I can confirm there's a single job instance being run. In fact it's the book example of the Spring Batch docs. I can even show the code to you cause the project is open-source. |
Thank you for your feedback.
In that case, there should really be a single As mentioned previously, as this issue has been closed and assigned to a release, please open a separate one with all these details and I will take a look. Thank you upfront. |
We have the same problem. I modified PR #3791so it can be merged to main branch |
@hpoettker our problem is that we have thousands steps and all of then load in-memory every time it checks step result. It leads to OutOfMemory. Your fix doesn't change this behavior and can't resolve the problem. In the fix from @galovics it doesn't load all the steps but get the count of incomplete steps from the database. It works much faster and consumer much less memory. |
Bug description
We found that memory consumption is fairly high on one of the service nodes that uses the Spring Batch. Even though both data nodes did a similar amount of work, the memory consumption across nodes was not even - 15GB vs 1.5GB (see memory use screenshot).
We have some jobs that could run for seconds while others might run for hours, so we set the polling interval (MessageChannelPartitionHandler#setPollInterval) to 1 second rather than 10 seconds that is the default. In a large running job scenario, we ended up creating 837 step executions.
What I found was that MessageChannelPartitionHandler#pollReplies gets a full StepExecution representation for each step, which contains a JobExecution which would also contain StepExecutions for each. However, they are retrieved at different times and stages. This means that we end up with square number of StepExecution objects, e.g. 837*837=700569 StepExecutions (see screenshot below)
Environment
Initially reproduced on Spring Batch 4.1.4.
Expected behavior
My proposal would be to:
Memory usage graph comparison between two service nodes, doing roughly equal amount of work:
My apologies for a messy screenshot, but it does explain the number of StepExecution objects:
The text was updated successfully, but these errors were encountered: