-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
R Package: Planning and task tracking #6
Comments
Hi! |
Yes totally sounds like a cool idea!! |
Hey everyone involved in this R package! |
I have been a bit on and off during the day, but in the end I have tried doing some "brainstorming". Here are some of my thoughts:
That is what my brain managed to spew out today. I don't know where to go from here, but if people think anything could make a good package I can try to flesh it a bit more out before our next working session. |
@AndersAskeland Amazing ideas! I am also totally keen on Idea 2!! |
Excellent ideas! @AndersAskeland I set up a package repository so we had a place to store our thoughts besides this thread, when our big-brain ideas grow too big for it :) |
I'm commuting to Copenhagen (SDCC) and was hoping to join you guys on Discord today, but my internet connectivity here is horrible. I've been wanting to look at stuff to enable a full end-to-end reproducible scientific workflow on "secure" offline servers, such as Statistics Denmark. What I found the most frustrating thing when writing my paper on Statistics Denmark's servers was having to type in your bibliography BY HAND! (you can't even copy-paste). So, I've been looking at ways of creating an offline database of pubmed citations in a .bib-friendly format. For starters, I envision a minimalistic package that simply contains an attached object with citation strings for every paper ever published in a PubMed-indexed journal. Or maybe by decade or year, if size becomes a problem. I imagine being able to search the citations with regex patterns. There's already the easyPubMed-package that queries PubMed citation (https://cran.r-project.org/web/packages/easyPubMed/vignettes/getting_started_with_easyPubMed.html), but that won't work in an offline environment. I hope to create something similar, but with less functionality, as I don't intended the tool to be used to search the literature, but just as a way to facilitate citation of the papers you've already found. The first step would be to download the MEDLINE/PubMed citation library via their FTP, then process it to scrape away unneccessary stuff, e.g. abstracts. And then save the remaining content in a file/format/object that can be embedded in an R package and read by R. Again, splitting the package into separate packages for each decade/year/field is an option to reduce the size of each package and increase user-level performance (less data to open/decompress = faster). What do you think? Sound like a |
A great idea! One could perhaps use the PubMed API (https://pubmed.ncbi.nlm.nih.gov/download/) on install to download data. That way you would not have to include data in any package files. A added benefit would be that the package could update the article database without having to update the package. However, there might be an bottleneck related to database lookup, wherein I think R might struggle searching large files directly (i.e. a large .bib file). I am unsure if R provides good tooling for databases, but I imagine one would need to store the data in some sort of relational database (sql'ish), and generate smaller .bib files based on lookup. I think dbplyr could work, but I am unsure. |
Yea, really cool idea. One difficulty would be that CRAN doesn't allow data files to be larger than 5Mb, so that puts a major limitation to uploading to CRAN and getting access via the server. I like the idea of having it as a GitHub only package. What are DST's policies regarding that? Alternatively (and this isn't about making a package), you could write out the bib citation key in the Markdown file and when you download it from DST, knit it outside DST. 🤷 |
Yeah, I just realized CRAN is even more restrictive on package size than I thought (5MB limit), so embedding the citations is off the table there. This leaves two options:
I don't know if DST can be asked to download the whole PubMed/MEDLINE library and put it on their network drive. In that case the package could just be directed to the local folder. Don't know if the purpose is too niche to justify a solution. I'll try to look into it, at least. |
I spent half a day reading up on xml files and starting a repo: I think it's doable, but the pubmed xml-files are tricky. I'll see if I can put a few more hours into it, then it should be possible to at least create an R object with citations. |
General aim: Build an R package(s) that automates or streamlines some basic setup, open science, reproducibility, general workflow, and organizational tasks.
Tasks to do (2022-01-17 session)
Before session
During session
Assign yourself to one of these tasks that you want do to/work on.
use_r('FILENAME')
, create a function inside and add Roxygen documentation). Refer to the R Packages chapter on R code for more help.vignettes/articles/reflections/YOURNAME.md
file, so we can use these thoughts to add to and refine how we work together, to see what works and what could be improved.Tasks to do (2022-02-21 session)
Assign yourself to one of these tasks that you want do to/work on.
vignettes/reflections/YOURNAME.md
file, so we can use these thoughts to add to and refine how we work together, to see what works and what could be improved.The text was updated successfully, but these errors were encountered: