Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase S3 socket limit to 250 #2093

Merged
merged 1 commit into from
Nov 2, 2024
Merged

Increase S3 socket limit to 250 #2093

merged 1 commit into from
Nov 2, 2024

Conversation

audiodude
Copy link
Member

Fixes #2092

This should increase the number of underlying sockets that the AWS S3 library uses. Given that the default is 50, and the error mentioned 113 requests waiting, 250 should be sufficient.

Tested on my local machine, I was able to create a bm ZIM without problem.

Copy link

codecov bot commented Oct 31, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 74.46%. Comparing base (a9251cd) to head (e824520).
Report is 5 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2093   +/-   ##
=======================================
  Coverage   74.45%   74.46%           
=======================================
  Files          41       41           
  Lines        3144     3145    +1     
  Branches      688      688           
=======================================
+ Hits         2341     2342    +1     
  Misses        683      683           
  Partials      120      120           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@benoit74
Copy link
Contributor

For me this solution is just an escape, which does not really solve the real problem which seems to me to be rather "why do we start so many parallel connections to S3?". Increasing the number of parallel connections can only lead to performance issues at one point or another, and in no way allows the scraper to run faster. Is it so hard to limit the number of parallel connections to S3?

@kelson42
Copy link
Collaborator

kelson42 commented Oct 31, 2024

@benoit74 I'm not sure to understand what would be a better architecture. Do you suggest some kind of pooling there with S3 client workers?

@benoit74
Copy link
Contributor

I think all parallel operations in scrapers should be based on a fixed size worker pool, yes.

What we have in most (all?) Python scrapers (but I might be completely off-topic) is that whenever we are doing things in parallel, we have a limited set of workers (typically 4, 10, ...) working of a queue of tasks. E.g. for storing pictures in a ZIM, we have one task per picture to either retrieve from S3 or download and optimize. And 4 to 10 workers, each working on one task at a time. While it does not allows to be very adaptive (depending on the IO or CPU nature of work to do, the optimal number of worker might change), it allows to avoid loosing time switching context.

I know that Node.JS paradigms are different and they favor asynchronous "by-design", but this has been meant for web servers where you have to process web request in parallel, this is not something you can control. And it is generally better to very slowly progress on all connections (to avoid timeouts) rather than queuing them and answering after the client timeout, having loosed your effort. Here in a scraper, we can totally choose our level of parallelism and we do not have any timeout on tasks to perform.

What we've seen in Python scraper is that passed a certain level of parallelism, the bigger the work pool the slower the application because it induces too much context switching at all levels.

@audiodude
Copy link
Member Author

Is it so hard to limit the number of parallel connections to S3?

Unfortunately, yes, it is.

The mwoffliner is designed to process articles as soon as they are available. Part of this processing includes saving the images to S3, which is done simply by dispatching an async task to do so. There is no concept of an mwoffliner "worker".

Trying to manage the rate at which mwoffliner uploads images would require rewriting major parts of the system, using architecture and abstractions that we don't even have.

I recommend merging this to fix the immediate problem, and revisiting whether we want to rearchitect/rewrite mwoffliner.

@kelson42
Copy link
Collaborator

kelson42 commented Nov 2, 2024

I recommend merging this to fix the immediate problem, and revisiting whether we want to rearchitect/rewrite mwoffliner.

I agree with this conclusion. If we have the proof that we face a bootleneckt here, we can reduce the number of parallelisation in general.

@kelson42 kelson42 merged commit 53dbc8c into main Nov 2, 2024
3 checks passed
@kelson42 kelson42 deleted the concurrent-sockets branch November 2, 2024 06:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Too many requests are started in parallel
3 participants