Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scrape X-rxiv via API #33

Open
jannisborn opened this issue Sep 25, 2023 · 4 comments
Open

Scrape X-rxiv via API #33

jannisborn opened this issue Sep 25, 2023 · 4 comments
Labels
enhancement New feature or request

Comments

@jannisborn
Copy link
Owner

Currently bio/med/chemrxiv scraping requires user to first download the entire DB and store locally.

Ideally, these dumps should be stored on a server and updated regularly (cron job). Users would just send requests to the server API. That would be the new default behaviour, but local download should still be supported too

@AstroWaffleRobot
Copy link

AstroWaffleRobot commented Jan 11, 2024

Hi there. Thanks for work on this project. As a temporary solution, I've saved the dbs on a requester-payer s3 bucket. To download the jsonl files, use these commands:

aws s3 cp s3://astrowafflerp/biorxiv.jsonl biorxiv.jsonl --request-payer
aws s3 cp s3://astrowafflerp/chemrxiv.jsonl chemrxiv.jsonl --request-payer
aws s3 cp s3://astrowafflerp/medrxiv.jsonl medrxiv.jsonl --request-payer

https://docs.aws.amazon.com/AmazonS3/latest/userguide/ObjectsinRequesterPaysBuckets.html

I've got a cron job that runs daily, so they should be current, but let me know if you have any trouble.

Here's the maintainer script: https://github.com/AstroWaffleRobot/getlit

@jannisborn
Copy link
Owner Author

Hi @AstroWaffleRobot,
Thx this is a nice initiative and it's great the script is also available.
I'd like to have an internal solution inside paperscraper, the easy way would be to adapt your code to create an update_dumps() function that just updates all local dumps since the tool was used the last time. Would be easy to also trigger it automatically whenever a search is performed to make sure it's up to date.

@jannisborn
Copy link
Owner Author

Long-term I want to create a lightweight API that I can deploy on my own VM which serves the requests. On the VM I want a daily cronjob to update the data and the API would run the package itself, just in its current mode where data is assumed to be locally available. That way, there's a dual usage, users could either use the package out of the box without slow download of the dumps or do it the old (current) way by downloading dumps first

@yarikoptic
Copy link
Contributor

I guess no more of that bucket?

dandi@drogon:~$ aws s3 ls s3://astrowafflerp/

An error occurred (NoSuchBucket) when calling the ListObjectsV2 operation: The specified bucket does not exist

FWIW -- I wanted to check sizes, I could have probably picked up serving those from https://datasets.datalad.org/ or some other S3 bucket

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants