getpapers FATAL ERROR when downloading 30k articles or more #64

EmanuelFaria · 2020-06-17T16:34:38Z

I got the error below last night trying to download more than 30k at a pop. When I limit my download to 29k or below (-k 29000), the problem goes away.

I found a discussion online about a fix to a similar-looking error here ... but even if this is useful, I don't know how to implement it. :-/

HERE'S WHAT HAPPENS:

It seems to stall in the "Retrieving results" phase. I get is these incremental updates as below

Retrieving results [==----------------------------] 97%
Retrieving results [==----------------------------] 98%

and then this error message:

<--- Last few GCs --->
at[39301:0x110000000] 739506 ms: Mark-sweep 2056.4 (2064.7) -> 2055.5 (2064.7) MB, 835.1 / 0.0 ms (+ 0.0 ms in 4 steps since start of marking, biggest step 0.0 ms, walltime since start of marking 837 ms) (average mu = 0.082, current mu = 0.002) allocatio[39301:0x110000000] 740780 ms: Mark-sweep 2056.5 (2064.7) -> 2055.7 (2064.9) MB, 1134.1 / 0.0 ms (+ 0.0 ms in 15 steps since start of marking, biggest step 0.0 ms, walltime since start of marking 1273 ms) (average mu = 0.098, current mu = 0.109) alloca

<--- JS stacktrace --->

==== JS stack trace =========================================

0: ExitFrame [pc: 0x100950919]
1: StubFrame [pc: 0x1009519b3]

Security context: 0x1e91dd3c08d1
2: write [0x1e91f4b412d9] [/usr/local/lib/node_modules/getpapers/node_modules/xml2js/node_modules/sax/lib/sax.js:~965] [pc=0x33fd2f286764](this=0x1e91b044eec1 ,0x1e9221e00119 <Very long string[8244830]>)
3: /* anonymous */ [0x1e91c4bc7169] [/usr/local/lib/node_modules/getpapers/node_modu...

FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory

Writing Node.js report to file: report.20200415.123321.39301.0.001.json
Node.js report completed
1: 0x100080c68 node::Abort() [/usr/local/bin/node]
2: 0x100080dec node::errors::TryCatchScope::~TryCatchScope() [/usr/local/bin/node]
3: 0x100185167 v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [/usr/local/bin/node]
4: 0x100185103 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [/usr/local/bin/node]
5: 0x10030b2f5 v8::internal::Heap::FatalProcessOutOfMemory(char const*) [/usr/local/bin/node]
6: 0x10030c9c4 v8::internal::Heap::RecomputeLimits(v8::internal::GarbageCollector) [/usr/local/bin/node]
7: 0x100309837 v8::internal::Heap::PerformGarbageCollection(v8::internal::GarbageCollector, v8::GCCallbackFlags) [/usr/local/bin/node]
8: 0x1003077fd v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [/usr/local/bin/node]
9: 0x100312fba v8::internal::Heap::AllocateRawWithLightRetry(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [/usr/local/bin/node]
10: 0x100313041 v8::internal::Heap::AllocateRawWithRetryOrFail(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [/usr/local/bin/node]
11: 0x1002e035b v8::internal::factory::NewFillerObject(int, bool, v8::internal::AllocationType, v8::internal::AllocationOrigin) [/usr/local/bin/node]
12: 0x100618718 v8::internal::Runtime_AllocateInYoungGeneration(int, unsigned long*, v8::internal::Isolate*) [/usr/local/bin/node]
13: 0x100950919 Builtins_CEntry_Return1_DontSaveFPRegs_ArgvOnStack_NoBuiltinExit [/usr/local/bin/node]
14: 0x1009519b3 Builtins_StringAdd_CheckNone [/usr/local/bin/node]
15: 0x33fd2f286764
Abort trap: 6

The text was updated successfully, but these errors were encountered:

ziflex · 2020-06-17T20:16:58Z

It seems that you have hit NodeJS's (V8) default heap memory limit.

You can try to run the app using the following args to increase the limit:

node --max-old-space-size=8192 ./bin/getpapers.js OTHER_ARGS

Either, there is a memory leak or design flaw that does not use data streaming in order to prevent such errors.

petermr · 2020-06-17T20:40:29Z

A messy work-around is to search by date-slices. Search after (say) 2016 search before 2016 and combine the results.

…

On Wed, Jun 17, 2020 at 9:17 PM Tim Voronov ***@***.***> wrote: It seems that you have hit NodeJS's (V8) default heap memory limit. You can try to run the app using the following args: node --max-old-space-size=8192 ./bin/getpapers.js OTHER_ARGS Either, there is a memory leak or design flaw that does not use data streaming in order to preven such errors. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#64 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFTCS6VMQ2ESBNPTA47OSDRXEQEPANCNFSM4OAYUG6Q> .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

EmanuelFaria · 2020-06-22T23:55:18Z

It seems that you have hit NodeJS's (V8) default heap memory limit.

You can try to run the app using the following args to increase the limit:
node --max-old-space-size=8192 ./bin/getpapers.js OTHER_ARGS
Either, there is a memory leak or design flaw that does not use data streaming in order to prevent such errors.

@ziflex Thanks for the tip!

I tried it the terminal command you posted, but it seemed my getpapers.js was hiding. Here's what I got:

Last login: Sun Jun 21 18:37:21 on console
`Mannys-MacBook-Pro:~ emanuelfaria$ node --max-old-space-size=8192 ./bin/getpapers.js OTHER_ARGS
internal/modules/cjs/loader.js:969
throw err;
^

_Error: Cannot find module '/Users/emanuelfaria/bin/getpapers.js'
at Function.Module._resolveFilename (internal/modules/cjs/loader.js:966:15)
at Function.Module.load (internal/modules/cjs/loader.js:842:27)
at Function.executeUserEntryPoint [as runMain] (internal/modules/run_main.js:71:12)
at internal/main/run_main_module.js:17:47 {
code: 'MODULE_NOT_FOUND',
requireStack: []
}

So I searched my drive and found it, and updated the snippet like so:

node --max-old-space-size=8192 /usr/local/lib/node_modules/getpapers/bin/getpapers.js OTHER_ARGS

This is the response I received... any idease?

error: No query given. You must provide the --query argument.

ziflex · 2020-06-23T01:57:59Z

By OTHER_ARGS I implied any valid getpapers arguments :)

petermr · 2020-06-23T11:46:08Z

It maybe we should use 'curl' or 'ferret`. `ami download` has a curl wrapper . I haven't used it on EPMC but it should be fairly straightforward to manage the cursor and do 1000 papers at a time. I don't see any obvious memory leaks with `curl`. the point is that when you have got to 30000 papers you know your query well and can afford to take the time to use another tool. BTW make sure you know what you are going to do with 30,000 articles. You don't want to download huge amounts if the processing has problems.

…

On Tue, Jun 23, 2020 at 2:58 AM Tim Voronov ***@***.***> wrote: By OTHER_ARGS I implied any valid getpapers arguments :) — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#64 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFTCS3MBON35MHEAP6GAEDRYAD3HANCNFSM4OAYUG6Q> .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

EmanuelFaria · 2020-06-24T12:18:46Z

By OTHER_ARGS I implied any valid getpapers arguments :)

Ooooohhhhhhh LOL. Thanks!

EmanuelFaria · 2020-06-24T12:20:46Z

do 1000 papers at a time.

@peter I don't know what curl or ferret means, but I think downloading 1000 at a time, rather than (what looks to me like) processing them 1000 at a time, and then waiting until the end to download them all is a good idea.

petermr · 2020-06-24T12:33:12Z

`curl` and `ferret` and 'scrapy` are possible alternatives to `getpapers`. They all download papers automatically. (the technical details may differ). But you can say "please download this many from EPMC). The papers are only useful if they are read or analyzed. I'm suggesting that you make sure you can do this for the first 30,000 OK before downloading more.

…

On Wed, Jun 24, 2020 at 1:21 PM Emanuel Faria ***@***.***> wrote: do 1000 papers at a time. @peter <https://github.com/peter> I don't know what curl or ferret means, but I think downloading 1000 at a time, rather than (what looks to me like) processing them 1000 at a time, and then waiting until the end to download them all is a good idea. — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#64 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFTCS3VB5LI22MHBKESIU3RYHVSZANCNFSM4OAYUG6Q> .

-- Peter Murray-Rust Founder ContentMine.org and Reader Emeritus in Molecular Informatics Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

EmanuelFaria · 2020-06-25T13:34:48Z

ok. Thanks Peter.

EmanuelFaria assigned petermr Jun 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

getpapers FATAL ERROR when downloading 30k articles or more #64

getpapers FATAL ERROR when downloading 30k articles or more #64

EmanuelFaria commented Jun 17, 2020 •

edited

Loading

ziflex commented Jun 17, 2020 •

edited

Loading

petermr commented Jun 17, 2020 via email

EmanuelFaria commented Jun 22, 2020 •

edited

Loading

ziflex commented Jun 23, 2020

petermr commented Jun 23, 2020 via email

EmanuelFaria commented Jun 24, 2020

EmanuelFaria commented Jun 24, 2020

petermr commented Jun 24, 2020 via email

EmanuelFaria commented Jun 25, 2020

getpapers FATAL ERROR when downloading 30k articles or more #64

getpapers FATAL ERROR when downloading 30k articles or more #64

Comments

EmanuelFaria commented Jun 17, 2020 • edited Loading

ziflex commented Jun 17, 2020 • edited Loading

petermr commented Jun 17, 2020 via email

EmanuelFaria commented Jun 22, 2020 • edited Loading

ziflex commented Jun 23, 2020

petermr commented Jun 23, 2020 via email

EmanuelFaria commented Jun 24, 2020

EmanuelFaria commented Jun 24, 2020

petermr commented Jun 24, 2020 via email

EmanuelFaria commented Jun 25, 2020

EmanuelFaria commented Jun 17, 2020 •

edited

Loading

ziflex commented Jun 17, 2020 •

edited

Loading

EmanuelFaria commented Jun 22, 2020 •

edited

Loading