Increase S3 socket limit to 250 #2093

audiodude · 2024-10-31T02:47:45Z

Fixes #2092

This should increase the number of underlying sockets that the AWS S3 library uses. Given that the default is 50, and the error mentioned 113 requests waiting, 250 should be sufficient.

Tested on my local machine, I was able to create a bm ZIM without problem.

codecov · 2024-10-31T05:37:44Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 74.46%. Comparing base (a9251cd) to head (e824520).
Report is 5 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #2093   +/-   ##
=======================================
  Coverage   74.45%   74.46%           
=======================================
  Files          41       41           
  Lines        3144     3145    +1     
  Branches      688      688           
=======================================
+ Hits         2341     2342    +1     
  Misses        683      683           
  Partials      120      120

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

benoit74 · 2024-10-31T09:18:11Z

For me this solution is just an escape, which does not really solve the real problem which seems to me to be rather "why do we start so many parallel connections to S3?". Increasing the number of parallel connections can only lead to performance issues at one point or another, and in no way allows the scraper to run faster. Is it so hard to limit the number of parallel connections to S3?

kelson42 · 2024-10-31T12:09:23Z

@benoit74 I'm not sure to understand what would be a better architecture. Do you suggest some kind of pooling there with S3 client workers?

benoit74 · 2024-10-31T12:57:35Z

I think all parallel operations in scrapers should be based on a fixed size worker pool, yes.

What we have in most (all?) Python scrapers (but I might be completely off-topic) is that whenever we are doing things in parallel, we have a limited set of workers (typically 4, 10, ...) working of a queue of tasks. E.g. for storing pictures in a ZIM, we have one task per picture to either retrieve from S3 or download and optimize. And 4 to 10 workers, each working on one task at a time. While it does not allows to be very adaptive (depending on the IO or CPU nature of work to do, the optimal number of worker might change), it allows to avoid loosing time switching context.

I know that Node.JS paradigms are different and they favor asynchronous "by-design", but this has been meant for web servers where you have to process web request in parallel, this is not something you can control. And it is generally better to very slowly progress on all connections (to avoid timeouts) rather than queuing them and answering after the client timeout, having loosed your effort. Here in a scraper, we can totally choose our level of parallelism and we do not have any timeout on tasks to perform.

What we've seen in Python scraper is that passed a certain level of parallelism, the bigger the work pool the slower the application because it induces too much context switching at all levels.

audiodude · 2024-11-01T22:39:58Z

Is it so hard to limit the number of parallel connections to S3?

Unfortunately, yes, it is.

The mwoffliner is designed to process articles as soon as they are available. Part of this processing includes saving the images to S3, which is done simply by dispatching an async task to do so. There is no concept of an mwoffliner "worker".

Trying to manage the rate at which mwoffliner uploads images would require rewriting major parts of the system, using architecture and abstractions that we don't even have.

I recommend merging this to fix the immediate problem, and revisiting whether we want to rearchitect/rewrite mwoffliner.

kelson42 · 2024-11-02T06:11:58Z

I recommend merging this to fix the immediate problem, and revisiting whether we want to rearchitect/rewrite mwoffliner.

I agree with this conclusion. If we have the proof that we face a bootleneckt here, we can reduce the number of parallelisation in general.

audiodude force-pushed the concurrent-sockets branch from 7ba2c03 to f383ebd Compare October 31, 2024 05:04

audiodude requested a review from benoit74 October 31, 2024 05:20

Increase S3 socket limit to 250

e824520

kelson42 force-pushed the concurrent-sockets branch from f383ebd to e824520 Compare November 2, 2024 06:48

kelson42 merged commit 53dbc8c into main Nov 2, 2024
3 checks passed

kelson42 deleted the concurrent-sockets branch November 2, 2024 06:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase S3 socket limit to 250 #2093

Increase S3 socket limit to 250 #2093

audiodude commented Oct 31, 2024

codecov bot commented Oct 31, 2024 •

edited

Loading

benoit74 commented Oct 31, 2024

kelson42 commented Oct 31, 2024 •

edited

Loading

benoit74 commented Oct 31, 2024

audiodude commented Nov 1, 2024

kelson42 commented Nov 2, 2024

Increase S3 socket limit to 250 #2093

Increase S3 socket limit to 250 #2093

Conversation

audiodude commented Oct 31, 2024

codecov bot commented Oct 31, 2024 • edited Loading

Codecov Report

benoit74 commented Oct 31, 2024

kelson42 commented Oct 31, 2024 • edited Loading

benoit74 commented Oct 31, 2024

audiodude commented Nov 1, 2024

kelson42 commented Nov 2, 2024

codecov bot commented Oct 31, 2024 •

edited

Loading

kelson42 commented Oct 31, 2024 •

edited

Loading