-
-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase S3 socket limit to 250 #2093
Conversation
7ba2c03
to
f383ebd
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #2093 +/- ##
=======================================
Coverage 74.45% 74.46%
=======================================
Files 41 41
Lines 3144 3145 +1
Branches 688 688
=======================================
+ Hits 2341 2342 +1
Misses 683 683
Partials 120 120 ☔ View full report in Codecov by Sentry. |
For me this solution is just an escape, which does not really solve the real problem which seems to me to be rather "why do we start so many parallel connections to S3?". Increasing the number of parallel connections can only lead to performance issues at one point or another, and in no way allows the scraper to run faster. Is it so hard to limit the number of parallel connections to S3? |
@benoit74 I'm not sure to understand what would be a better architecture. Do you suggest some kind of pooling there with S3 client workers? |
I think all parallel operations in scrapers should be based on a fixed size worker pool, yes. What we have in most (all?) Python scrapers (but I might be completely off-topic) is that whenever we are doing things in parallel, we have a limited set of workers (typically 4, 10, ...) working of a queue of tasks. E.g. for storing pictures in a ZIM, we have one task per picture to either retrieve from S3 or download and optimize. And 4 to 10 workers, each working on one task at a time. While it does not allows to be very adaptive (depending on the IO or CPU nature of work to do, the optimal number of worker might change), it allows to avoid loosing time switching context. I know that Node.JS paradigms are different and they favor asynchronous "by-design", but this has been meant for web servers where you have to process web request in parallel, this is not something you can control. And it is generally better to very slowly progress on all connections (to avoid timeouts) rather than queuing them and answering after the client timeout, having loosed your effort. Here in a scraper, we can totally choose our level of parallelism and we do not have any timeout on tasks to perform. What we've seen in Python scraper is that passed a certain level of parallelism, the bigger the work pool the slower the application because it induces too much context switching at all levels. |
Unfortunately, yes, it is. The mwoffliner is designed to process articles as soon as they are available. Part of this processing includes saving the images to S3, which is done simply by dispatching an async task to do so. There is no concept of an mwoffliner "worker". Trying to manage the rate at which mwoffliner uploads images would require rewriting major parts of the system, using architecture and abstractions that we don't even have. I recommend merging this to fix the immediate problem, and revisiting whether we want to rearchitect/rewrite mwoffliner. |
I agree with this conclusion. If we have the proof that we face a bootleneckt here, we can reduce the number of parallelisation in general. |
f383ebd
to
e824520
Compare
Fixes #2092
This should increase the number of underlying sockets that the AWS S3 library uses. Given that the default is 50, and the error mentioned 113 requests waiting, 250 should be sufficient.
Tested on my local machine, I was able to create a bm ZIM without problem.