Poll workers of a partitioned step with a database query #4705

Ichanskiy · 2024-11-12T18:13:36Z

Fix OutOfMemory issue by optimizing step result checks. Previously, thousands of StepExecutions were loaded into memory every 10000 ms, causing memory overload. There’s no point in loading all StepExecutions into memory to check the running status at short intervals.

private long pollInterval = 10000;
Poller<Set<StepExecution>> poller = new DirectPoller<>(pollInterval);
Future<Set<StepExecution>> resultsFuture = poller.poll(callback);

We've had this code in production for over a year now, and we no longer get any OutOfMemory errors — it works! However, with each Spring Boot upgrade in our services, we have to patch Spring Batch with this fix, which has been a constant inconvenience for us every time to make the .patch , deploy the artifact etc.

That's why we've created this pull request, which essentially addresses the same problem as PR 3791. Although that PR was approved, it cannot be merged due to conflicts. @fmbenhassine asked that the conflicts be resolved here , but this was not done. In my pull request, I've included the code with the latest changes

I would greatly appreciate it if you could review my pull request, so we can merge it and permanently resolve the OutOfMemory issue.

Closes #3790

…fix OOM

hpoettker · 2024-11-13T08:16:51Z

Have you tried Spring Batch 5.1.2 or Spring Batch 5.0.6? It contains a fix for #4598, which looks like the same as #3790 to me.

It would be very helpful to see how the fix for #4598 affects your application. Does it solve your problem? Does it improve the situation a little? Does it not help at all?

Ichanskiy · 2024-11-13T09:18:06Z

Have you tried Spring Batch 5.1.2 or Spring Batch 5.0.6? It contains a fix for #4598, which looks like the same as #3790 to me.

It would be very helpful to see how the fix for #4598 affects your application. Does it solve your problem? Does it improve the situation a little? Does it not help at all?

The main problem is that all steps are loaded into memory to check the step result. If there are many steps (we have thousands), this leads to an OutOfMemory error.
The fix you mentioned doesn't solve our problem because all steps are still loaded into memory. My fix (or 3791) doesn’t load all the steps; instead, it just retrieves the count of incomplete steps from the database. This works much faster and doesn't consume a lot of memory.

hpoettker · 2024-11-13T09:55:06Z

I feel that your response is evading a reply to my question whether you have tried the most recent fix. Please give the fix in Spring Batch 5.1.2 or 5.0.6 an honest try, if you haven't done so. I'd really appreciate the feedback.

You are right that the implementation calls JobExplorer::getJobExecution, which loads all step executions. But in the current implementation the result from that call will be completely garbage collected after each non-final poll. So this does not lead to the memory requirements that previous implementations required. Only enough heap space for loading all step executions once is required.

The reason I'm persisting is this: Your proposed implementation of MessageChannelPartitionHandler still loads basically all step executions into memory at the end through your proposed new method getStepExecutions. This means you should already have sufficient heap space to load all step executions in your application.

Ichanskiy · 2024-11-13T12:15:18Z

Perhaps I didn’t express myself entirely clearly. The main problem is that all steps are loaded into memory to check the step result every 10,000 ms.

private long pollInterval = 10000;
Poller<Set<StepExecution>> poller = new DirectPoller<>(pollInterval);
Future<Set<StepExecution>> resultsFuture = poller.poll(callback);

In other words, this is done regularly at short intervals. With large long-running jobs, we encounter an OOM error because the memory keeps filling up with these steps every 10,000 ms.

The reason I'm persisting is this: Your proposed implementation of MessageChannelPartitionHandler still loads basically all step executions into memory at the end through your proposed new method getStepExecutions. This means you should already have sufficient heap space to load all step executions in your application.

You’re right, we have enough memory to load all steps from the database if it’s done as a one-time action (as in my fix). But in the current implementation, we do this with a pollInterval = 10,000 ms, and garbage collection can't keep up. This has been verified experimentally. As soon as we stop constantly loading everything into memory while the job is running, we no longer have memory issues. This code has been working in production for over a year.

hpoettker · 2024-11-13T22:11:47Z

At the end of the day, I'm not the final arbiter of this discussion as I'm just a contributor and not a maintainer of the project. So feel free to discard what I'm saying.

The reproducing example from #4598 reduced the poll interval from the default 10,000 ms to just 2 ms. And the fix that is included in the latest Spring Batch 5.x releases has been successfully tested against this reproducer.

I understand that you are very much invested in your proposed solution. But it's a breaking change in public APIs, which would potentially lead to migration efforts for other community members. I'd still be happy to support you, if you publish a minimal complete reproducible example against the latest 5.x releases.

fmbenhassine · 2024-11-20T23:59:49Z

@Ichanskiy @hpoettker Thank you both for your constructive feedback towards each other comments! It is really a pleasure to see such interesting discussions in a constructive way.

I understand the issue and my opinion is that the final arbiter is pragmatism and common sense. As mentioned in my previous comment #3790 (comment), any computation that can be done at the database level should be done at the database level, unless there is a very good reason not to do so.

I think it is important to consider the past context in order to explain the reason behind previous fixes and also to make the best decision now. When the polling performance issue was first reported, we did our best to fix the memory leak in a backward compatible way in 93800c6 (Many thanks to @hpoettker for that!). At that time, we had to do it that way because we were required to backport the patch to 5.0.x. A change set like in this PR is clearly breaking and cannot be considered as such in a patch version, not even in a minor version (exceptions are accepted if we manage to have default implementations of new methods added in public interfaces).

That said and based on facts and numbers, I believe that the performance issue was resolved for a single partitioned step. Now if someone runs a thousand partitioned steps in the same JVM with a short polling interval, I think everyone would agree that this becomes out of the scope of Spring Batch. So it seems like the OOM discussed here is due to the deployment anti-pattern of running several batch jobs in the same JVM.

Again, Ideally the count should be done with a single query on the db side, and we are not there yet as of 5.1.2. While this PR goes in the right direction, it still not the optimal way (in my opinion) to achieve that. Why is there a new method to get step executions by a given list of statuses and not something like JobExplorer#findRunningStepExecutions (similar to the current JobExplorer#findRunningJobExecutions) or JobExplorer#countRunningStepExecutions? Adding methods to the JobExplorer API should be carefully done. If a method is used only for a given feature, it should not go into JobExplorer.

Anyway, I did not work on this PR in details until now, and even though I wanted to include it in 5.2, unfortunately this won't be possible given the breaking changes and API design considerations that need to be discussed further.

Now in the meantime and if it were up to me given the context of a temporary solution, I would create a custom extension of MessageChannelPartitionHandler that overrides the polling method with a jdbc template that calls the query that count the remaining running workers. As simple as that, I would keep it simple as I know it is a temporary solution (I know we need to make the method pollReplies protected as well as the datasource field, but this is not an issue, we can do that even in a patch release).

Now that said, and I see there are currently three open PRs for this request and we need to wrap things up:

Poll the count of running step executions #3791: goes in the right direction, but not optimal. I asked for a rebase, but no reply from the contributor. It will be closed
Poll the count of running step executions (updated) #4530: seems like an updated version of Poll the count of running step executions #3791, meaning it is updating something that goes in the right direction, but still not optimal. It will be closed.
Poll workers of a partitioned step with a database query #4705: this PR. goes in the right direction, still not optimal, but it is the closest one to the best way of doing things in my opinion. I will keep it open and add a comment on how to update it.

spring-projectsGH-3790: Poll the count of running step executions to …

22f98ef

…fix OOM

This was referenced Nov 21, 2024

Poll the count of running step executions (updated) #4530

Closed

Poll the count of running step executions #3791

Closed

fmbenhassine added pr-for: enhancement in: integration in: core labels Nov 21, 2024

fmbenhassine added this to the 6.0.0 milestone Nov 21, 2024

fmbenhassine changed the title ~~GH-3790: Poll the count of running step executions to fix OOM~~ Poll workers of a partitioned step with a database query Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poll workers of a partitioned step with a database query #4705

Poll workers of a partitioned step with a database query #4705

Ichanskiy commented Nov 12, 2024 •

edited

Loading

hpoettker commented Nov 13, 2024 •

edited

Loading

Ichanskiy commented Nov 13, 2024

hpoettker commented Nov 13, 2024 •

edited

Loading

Ichanskiy commented Nov 13, 2024

hpoettker commented Nov 13, 2024

fmbenhassine commented Nov 20, 2024

Poll workers of a partitioned step with a database query #4705

Are you sure you want to change the base?

Poll workers of a partitioned step with a database query #4705

Conversation

Ichanskiy commented Nov 12, 2024 • edited Loading

hpoettker commented Nov 13, 2024 • edited Loading

Ichanskiy commented Nov 13, 2024

hpoettker commented Nov 13, 2024 • edited Loading

Ichanskiy commented Nov 13, 2024

hpoettker commented Nov 13, 2024

fmbenhassine commented Nov 20, 2024

Ichanskiy commented Nov 12, 2024 •

edited

Loading

hpoettker commented Nov 13, 2024 •

edited

Loading

hpoettker commented Nov 13, 2024 •

edited

Loading