Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poll workers of a partitioned step with a database query #4705

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Ichanskiy
Copy link

@Ichanskiy Ichanskiy commented Nov 12, 2024

Fix OutOfMemory issue by optimizing step result checks. Previously, thousands of StepExecutions were loaded into memory every 10000 ms, causing memory overload. There’s no point in loading all StepExecutions into memory to check the running status at short intervals.

private long pollInterval = 10000;
Poller<Set<StepExecution>> poller = new DirectPoller<>(pollInterval);
Future<Set<StepExecution>> resultsFuture = poller.poll(callback);

We've had this code in production for over a year now, and we no longer get any OutOfMemory errors — it works! However, with each Spring Boot upgrade in our services, we have to patch Spring Batch with this fix, which has been a constant inconvenience for us every time to make the .patch , deploy the artifact etc.

That's why we've created this pull request, which essentially addresses the same problem as PR 3791. Although that PR was approved, it cannot be merged due to conflicts. @fmbenhassine asked that the conflicts be resolved here , but this was not done. In my pull request, I've included the code with the latest changes

I would greatly appreciate it if you could review my pull request, so we can merge it and permanently resolve the OutOfMemory issue.

Closes #3790

@hpoettker
Copy link
Contributor

hpoettker commented Nov 13, 2024

Have you tried Spring Batch 5.1.2 or Spring Batch 5.0.6? It contains a fix for #4598, which looks like the same as #3790 to me.

It would be very helpful to see how the fix for #4598 affects your application. Does it solve your problem? Does it improve the situation a little? Does it not help at all?

@Ichanskiy
Copy link
Author

Have you tried Spring Batch 5.1.2 or Spring Batch 5.0.6? It contains a fix for #4598, which looks like the same as #3790 to me.

It would be very helpful to see how the fix for #4598 affects your application. Does it solve your problem? Does it improve the situation a little? Does it not help at all?

The main problem is that all steps are loaded into memory to check the step result. If there are many steps (we have thousands), this leads to an OutOfMemory error.
The fix you mentioned doesn't solve our problem because all steps are still loaded into memory. My fix (or 3791) doesn’t load all the steps; instead, it just retrieves the count of incomplete steps from the database. This works much faster and doesn't consume a lot of memory.

@hpoettker
Copy link
Contributor

hpoettker commented Nov 13, 2024

I feel that your response is evading a reply to my question whether you have tried the most recent fix. Please give the fix in Spring Batch 5.1.2 or 5.0.6 an honest try, if you haven't done so. I'd really appreciate the feedback.

You are right that the implementation calls JobExplorer::getJobExecution, which loads all step executions. But in the current implementation the result from that call will be completely garbage collected after each non-final poll. So this does not lead to the memory requirements that previous implementations required. Only enough heap space for loading all step executions once is required.

The reason I'm persisting is this: Your proposed implementation of MessageChannelPartitionHandler still loads basically all step executions into memory at the end through your proposed new method getStepExecutions. This means you should already have sufficient heap space to load all step executions in your application.

@Ichanskiy
Copy link
Author

Perhaps I didn’t express myself entirely clearly. The main problem is that all steps are loaded into memory to check the step result every 10,000 ms.

private long pollInterval = 10000;
Poller<Set<StepExecution>> poller = new DirectPoller<>(pollInterval);
Future<Set<StepExecution>> resultsFuture = poller.poll(callback);

In other words, this is done regularly at short intervals. With large long-running jobs, we encounter an OOM error because the memory keeps filling up with these steps every 10,000 ms.

The reason I'm persisting is this: Your proposed implementation of MessageChannelPartitionHandler still loads basically all step executions into memory at the end through your proposed new method getStepExecutions. This means you should already have sufficient heap space to load all step executions in your application.

You’re right, we have enough memory to load all steps from the database if it’s done as a one-time action (as in my fix). But in the current implementation, we do this with a pollInterval = 10,000 ms, and garbage collection can't keep up. This has been verified experimentally. As soon as we stop constantly loading everything into memory while the job is running, we no longer have memory issues. This code has been working in production for over a year.

@hpoettker
Copy link
Contributor

At the end of the day, I'm not the final arbiter of this discussion as I'm just a contributor and not a maintainer of the project. So feel free to discard what I'm saying.

The reproducing example from #4598 reduced the poll interval from the default 10,000 ms to just 2 ms. And the fix that is included in the latest Spring Batch 5.x releases has been successfully tested against this reproducer.

I understand that you are very much invested in your proposed solution. But it's a breaking change in public APIs, which would potentially lead to migration efforts for other community members. I'd still be happy to support you, if you publish a minimal complete reproducible example against the latest 5.x releases.

@fmbenhassine
Copy link
Contributor

@Ichanskiy @hpoettker Thank you both for your constructive feedback towards each other comments! It is really a pleasure to see such interesting discussions in a constructive way.

I understand the issue and my opinion is that the final arbiter is pragmatism and common sense. As mentioned in my previous comment #3790 (comment), any computation that can be done at the database level should be done at the database level, unless there is a very good reason not to do so.

I think it is important to consider the past context in order to explain the reason behind previous fixes and also to make the best decision now. When the polling performance issue was first reported, we did our best to fix the memory leak in a backward compatible way in 93800c6 (Many thanks to @hpoettker for that!). At that time, we had to do it that way because we were required to backport the patch to 5.0.x. A change set like in this PR is clearly breaking and cannot be considered as such in a patch version, not even in a minor version (exceptions are accepted if we manage to have default implementations of new methods added in public interfaces).

That said and based on facts and numbers, I believe that the performance issue was resolved for a single partitioned step. Now if someone runs a thousand partitioned steps in the same JVM with a short polling interval, I think everyone would agree that this becomes out of the scope of Spring Batch. So it seems like the OOM discussed here is due to the deployment anti-pattern of running several batch jobs in the same JVM.

Again, Ideally the count should be done with a single query on the db side, and we are not there yet as of 5.1.2. While this PR goes in the right direction, it still not the optimal way (in my opinion) to achieve that. Why is there a new method to get step executions by a given list of statuses and not something like JobExplorer#findRunningStepExecutions (similar to the current JobExplorer#findRunningJobExecutions) or JobExplorer#countRunningStepExecutions? Adding methods to the JobExplorer API should be carefully done. If a method is used only for a given feature, it should not go into JobExplorer.

Anyway, I did not work on this PR in details until now, and even though I wanted to include it in 5.2, unfortunately this won't be possible given the breaking changes and API design considerations that need to be discussed further.

Now in the meantime and if it were up to me given the context of a temporary solution, I would create a custom extension of MessageChannelPartitionHandler that overrides the polling method with a jdbc template that calls the query that count the remaining running workers. As simple as that, I would keep it simple as I know it is a temporary solution (I know we need to make the method pollReplies protected as well as the datasource field, but this is not an issue, we can do that even in a patch release).

Now that said, and I see there are currently three open PRs for this request and we need to wrap things up:

@fmbenhassine fmbenhassine added this to the 6.0.0 milestone Nov 21, 2024
@fmbenhassine fmbenhassine changed the title GH-3790: Poll the count of running step executions to fix OOM Poll workers of a partitioned step with a database query Nov 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

High memory consumption during long running jobs
4 participants