-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
getpapers FATAL ERROR when downloading 30k articles or more #64
Comments
It seems that you have hit NodeJS's (V8) default heap memory limit. You can try to run the app using the following args to increase the limit:
Either, there is a memory leak or design flaw that does not use data streaming in order to prevent such errors. |
A messy work-around is to search by date-slices.
Search after (say) 2016
search before 2016
and combine the results.
…On Wed, Jun 17, 2020 at 9:17 PM Tim Voronov ***@***.***> wrote:
It seems that you have hit NodeJS's (V8) default heap memory limit.
You can try to run the app using the following args:
node --max-old-space-size=8192 ./bin/getpapers.js OTHER_ARGS
Either, there is a memory leak or design flaw that does not use data
streaming in order to preven such errors.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#64 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFTCS6VMQ2ESBNPTA47OSDRXEQEPANCNFSM4OAYUG6Q>
.
--
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
|
@ziflex Thanks for the tip! I tried it the terminal command you posted, but it seemed my getpapers.js was hiding. Here's what I got: Last login: Sun Jun 21 18:37:21 on console _Error: Cannot find module '/Users/emanuelfaria/bin/getpapers.js' So I searched my drive and found it, and updated the snippet like so: node --max-old-space-size=8192 /usr/local/lib/node_modules/getpapers/bin/getpapers.js OTHER_ARGS This is the response I received... any idease? error: No query given. You must provide the --query argument. |
By |
It maybe we should use 'curl' or 'ferret`. `ami download` has a curl
wrapper . I haven't used it on EPMC but it should be fairly straightforward
to manage the cursor and do 1000 papers at a time. I don't see any obvious
memory leaks with `curl`.
the point is that when you have got to 30000 papers you know your query
well and can afford to take the time to use another tool.
BTW make sure you know what you are going to do with 30,000 articles. You
don't want to download huge amounts if the processing has problems.
…On Tue, Jun 23, 2020 at 2:58 AM Tim Voronov ***@***.***> wrote:
By OTHER_ARGS I implied any valid getpapers arguments :)
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#64 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFTCS3MBON35MHEAP6GAEDRYAD3HANCNFSM4OAYUG6Q>
.
--
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
|
Ooooohhhhhhh LOL. Thanks! |
@peter I don't know what curl or ferret means, but I think downloading 1000 at a time, rather than (what looks to me like) processing them 1000 at a time, and then waiting until the end to download them all is a good idea. |
`curl` and `ferret` and 'scrapy` are possible alternatives to `getpapers`.
They all download papers automatically. (the technical details may differ).
But you can say "please download this many from EPMC).
The papers are only useful if they are read or analyzed. I'm suggesting
that you make sure you can do this for the first 30,000 OK before
downloading more.
…On Wed, Jun 24, 2020 at 1:21 PM Emanuel Faria ***@***.***> wrote:
do 1000 papers at a time.
@peter <https://github.com/peter> I don't know what curl or ferret means,
but I think downloading 1000 at a time, rather than (what looks to me like)
processing them 1000 at a time, and then waiting until the end to download
them all is a good idea.
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#64 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFTCS3VB5LI22MHBKESIU3RYHVSZANCNFSM4OAYUG6Q>
.
--
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
|
ok. Thanks Peter. |
Hi @petermr @deadlyvices
I got the error below last night trying to download more than 30k at a pop. When I limit my download to 29k or below (-k 29000), the problem goes away.
I found a discussion online about a fix to a similar-looking error here ... but even if this is useful, I don't know how to implement it. :-/
HERE'S WHAT HAPPENS:
It seems to stall in the "Retrieving results" phase. I get is these incremental updates as below
Retrieving results [==----------------------------] 97%
Retrieving results [==----------------------------] 98%
and then this error message:
<--- Last few GCs --->
at[39301:0x110000000] 739506 ms: Mark-sweep 2056.4 (2064.7) -> 2055.5 (2064.7) MB, 835.1 / 0.0 ms (+ 0.0 ms in 4 steps since start of marking, biggest step 0.0 ms, walltime since start of marking 837 ms) (average mu = 0.082, current mu = 0.002) allocatio[39301:0x110000000] 740780 ms: Mark-sweep 2056.5 (2064.7) -> 2055.7 (2064.9) MB, 1134.1 / 0.0 ms (+ 0.0 ms in 15 steps since start of marking, biggest step 0.0 ms, walltime since start of marking 1273 ms) (average mu = 0.098, current mu = 0.109) alloca
<--- JS stacktrace --->
==== JS stack trace =========================================
Security context: 0x1e91dd3c08d1
2: write [0x1e91f4b412d9] [/usr/local/lib/node_modules/getpapers/node_modules/xml2js/node_modules/sax/lib/sax.js:~965] [pc=0x33fd2f286764](this=0x1e91b044eec1 ,0x1e9221e00119 <Very long string[8244830]>)
3: /* anonymous */ [0x1e91c4bc7169] [/usr/local/lib/node_modules/getpapers/node_modu...
FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
Writing Node.js report to file: report.20200415.123321.39301.0.001.json
Node.js report completed
1: 0x100080c68 node::Abort() [/usr/local/bin/node]
2: 0x100080dec node::errors::TryCatchScope::~TryCatchScope() [/usr/local/bin/node]
3: 0x100185167 v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [/usr/local/bin/node]
4: 0x100185103 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [/usr/local/bin/node]
5: 0x10030b2f5 v8::internal::Heap::FatalProcessOutOfMemory(char const*) [/usr/local/bin/node]
6: 0x10030c9c4 v8::internal::Heap::RecomputeLimits(v8::internal::GarbageCollector) [/usr/local/bin/node]
7: 0x100309837 v8::internal::Heap::PerformGarbageCollection(v8::internal::GarbageCollector, v8::GCCallbackFlags) [/usr/local/bin/node]
8: 0x1003077fd v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [/usr/local/bin/node]
9: 0x100312fba v8::internal::Heap::AllocateRawWithLightRetry(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [/usr/local/bin/node]
10: 0x100313041 v8::internal::Heap::AllocateRawWithRetryOrFail(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [/usr/local/bin/node]
11: 0x1002e035b v8::internal::factory::NewFillerObject(int, bool, v8::internal::AllocationType, v8::internal::AllocationOrigin) [/usr/local/bin/node]
12: 0x100618718 v8::internal::Runtime_AllocateInYoungGeneration(int, unsigned long*, v8::internal::Isolate*) [/usr/local/bin/node]
13: 0x100950919 Builtins_CEntry_Return1_DontSaveFPRegs_ArgvOnStack_NoBuiltinExit [/usr/local/bin/node]
14: 0x1009519b3 Builtins_StringAdd_CheckNone [/usr/local/bin/node]
15: 0x33fd2f286764
Abort trap: 6
The text was updated successfully, but these errors were encountered: