Scrape X-rxiv via API #33

jannisborn · 2023-09-25T21:00:32Z

Currently bio/med/chemrxiv scraping requires user to first download the entire DB and store locally.

Ideally, these dumps should be stored on a server and updated regularly (cron job). Users would just send requests to the server API. That would be the new default behaviour, but local download should still be supported too

AstroWaffleRobot · 2024-01-11T21:34:40Z

Hi there. Thanks for work on this project. As a temporary solution, I've saved the dbs on a requester-payer s3 bucket. To download the jsonl files, use these commands:

aws s3 cp s3://astrowafflerp/biorxiv.jsonl biorxiv.jsonl --request-payer
aws s3 cp s3://astrowafflerp/chemrxiv.jsonl chemrxiv.jsonl --request-payer
aws s3 cp s3://astrowafflerp/medrxiv.jsonl medrxiv.jsonl --request-payer

https://docs.aws.amazon.com/AmazonS3/latest/userguide/ObjectsinRequesterPaysBuckets.html

I've got a cron job that runs daily, so they should be current, but let me know if you have any trouble.

Here's the maintainer script: https://github.com/AstroWaffleRobot/getlit

jannisborn · 2024-01-11T23:57:22Z

Hi @AstroWaffleRobot,
Thx this is a nice initiative and it's great the script is also available.
I'd like to have an internal solution inside paperscraper, the easy way would be to adapt your code to create an update_dumps() function that just updates all local dumps since the tool was used the last time. Would be easy to also trigger it automatically whenever a search is performed to make sure it's up to date.

jannisborn · 2024-01-12T00:13:16Z

Long-term I want to create a lightweight API that I can deploy on my own VM which serves the requests. On the VM I want a daily cronjob to update the data and the API would run the package itself, just in its current mode where data is assumed to be locally available. That way, there's a dual usage, users could either use the package out of the box without slow download of the dumps or do it the old (current) way by downloading dumps first

yarikoptic · 2024-10-29T16:39:57Z

I guess no more of that bucket?

dandi@drogon:~$ aws s3 ls s3://astrowafflerp/

An error occurred (NoSuchBucket) when calling the ListObjectsV2 operation: The specified bucket does not exist

FWIW -- I wanted to check sizes, I could have probably picked up serving those from https://datasets.datalad.org/ or some other S3 bucket

jannisborn mentioned this issue Dec 6, 2023

Remote diconnected and didnt download files #34

Closed

jannisborn added the enhancement New feature or request label Dec 6, 2023

memray mentioned this issue May 21, 2024

fatal error: An error occurred (400) when calling the HeadObject operation: Bad Request AstroWaffleRobot/getlit#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scrape X-rxiv via API #33

Scrape X-rxiv via API #33

jannisborn commented Sep 25, 2023

AstroWaffleRobot commented Jan 11, 2024 •

edited

Loading

jannisborn commented Jan 11, 2024

jannisborn commented Jan 12, 2024

yarikoptic commented Oct 29, 2024

Scrape X-rxiv via API #33

Scrape X-rxiv via API #33

Comments

jannisborn commented Sep 25, 2023

AstroWaffleRobot commented Jan 11, 2024 • edited Loading

jannisborn commented Jan 11, 2024

jannisborn commented Jan 12, 2024

yarikoptic commented Oct 29, 2024

AstroWaffleRobot commented Jan 11, 2024 •

edited

Loading