Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

R Package: Planning and task tracking #6

Open
1 of 2 tasks
lwjohnst86 opened this issue Jan 10, 2022 · 11 comments
Open
1 of 2 tasks

R Package: Planning and task tracking #6

lwjohnst86 opened this issue Jan 10, 2022 · 11 comments

Comments

@lwjohnst86
Copy link
Member

lwjohnst86 commented Jan 10, 2022

General aim: Build an R package(s) that automates or streamlines some basic setup, open science, reproducibility, general workflow, and organizational tasks.

Tasks to do (2022-01-17 session)

Before session

  • Decide on whether to create a whole new package, contribute to other existing ones, or work on one I've been slowly adding to over the years (rostools).

During session

Assign yourself to one of these tasks that you want do to/work on.

  • (If new to making R Packages) Read up on the basics of developing R packages in the Whole Game chapter of the R Packages book.
  • Write up more detailed contributing guidelines for others working on the project (e.g. how and why to push and make pull requests, resources to use to learn more, etc). Think in mind of a beginner and what they might need to help out. This document will be used as a template for later work.
  • Brainstorm and sketch out (paper and pen) a diagram of the research workflow and identify areas that this package could make easier (things that could be automated or simplified)
  • (Multiple people) Brainstorm some functions/tasks that the package will do.
  • Once some functions have been brainstormed, select one that interests you and add it to the package repository (with use_r('FILENAME'), create a function inside and add Roxygen documentation). Refer to the R Packages chapter on R code for more help.
  • (Everyone) Keep a log/journal of your experiences working on and collaborating in this project in the vignettes/articles/reflections/YOURNAME.md file, so we can use these thoughts to add to and refine how we work together, to see what works and what could be improved.

Tasks to do (2022-02-21 session)

Assign yourself to one of these tasks that you want do to/work on.

  • Write up more detailed contributing guidelines for the team working on this project (e.g. how and why to push and make pull requests, resources to use to learn more, etc), so we know how to collaborate well together. This document will be used as a template for later work.
  • Brainstorm name of this new package: Brainstorm name of this package package#1
  • Brainstorm the functions of this package: Brainstorm functions inside package package#2
  • (Everyone) Keep a log/journal of your experiences working on and collaborating in this project in the vignettes/reflections/YOURNAME.md file, so we can use these thoughts to add to and refine how we work together, to see what works and what could be improved.
@MaleneRevsbech
Copy link

Hi!
In 'Brainstorm some functions/tasks that the package will do', could it be useful with an English-American language converter (suggestion from a participant at the R course), or is that out of scope?

@lwjohnst86
Copy link
Member Author

Yes totally sounds like a cool idea!!

@MaleneRevsbech
Copy link

Hey everyone involved in this R package!
I just wanted to let you know I'm looking into making a function that can translate text from English-American and vice versa. If you have any inputs or suggestions, just comment here :)

@AndersAskeland
Copy link

I have been a bit on and off during the day, but in the end I have tried doing some "brainstorming". Here are some of my thoughts:

  • Metapackage
    • Collections of functions related to reproducible research. Can both be already existing functions (devtools, usethis, progenitr) and our own functions.
    • Could provide streamlined documentation and a lower learning curve.
    • I am not sure if it is actually something people would use or prefer to use compared to original packages.
  • Package that provides guided data analysis setup (I like this idea the most)
    • Package to make it easier for beginners to perform reproducible data analysis
    • Something similar to swirl (interactive R lessons) but instead of providing lessons it guides the used through project setup and analysis.
    • Can either be done via R terminal prompt (like swirl) or via r studio dialogs (possible via rstudioapi package) or both.
    • Ask the user what they want to do (i.e. setup git, setup folder structure, generate report).
    • In addition to giving a bunch of options, also gives easy to follow documentation.
    • Quick example of how it could work:
      1. Do you want to use git with this project?
        • Yes -> Check if git is installed
          • Yes -> Setup repository and go to part 2
          • No -> Install git or give instructions how to install git
        • No -> Continue to part 2
      2. Do you want to upload the git repository to a remote location (GitHub/gitlab)
        • Yes -> Do you want to create a new repository or use an existing?
          • New -> Create new and continue to part 3
          • Existing -> Use existing and go to part 3
        • No -> Continue to part 3
          3. Do you want to create a default folder structure?
        • Yes -> Choose which structure
        • No -> Continue to part 4
          4. import data
          5. And so on...
  • Guided teaching materials for R course (similar to swirl)
  • Custom simplified error checker for teaching or beginners
    • During the R courses and for R beginners there are a lot of similar, easy to debug mistakes being commit. Usually most of these mistakes are simple to fix, however, as people are very unfamiliar with the rather confusing error codes it is difficult for themselves to diagnose the problem.
    • This package could take non-working functions as inputs and return a detailed description of the most likely error and an explanation of the actual error code.
  • Docker container used for teaching
    • Create linux based docker container for use in courses.
    • Could mitigate technical difficulties during the course
    • I however do not know how easy and what kinda permissions docker need to be installed on work computers.

That is what my brain managed to spew out today. I don't know where to go from here, but if people think anything could make a good package I can try to flesh it a bit more out before our next working session.

@MaleneRevsbech
Copy link

@AndersAskeland Amazing ideas! I am also totally keen on Idea 2!!
Perhaps idea 3 and 4 could be put together? I also find 1 quite handy :)

@Aastedet
Copy link

Excellent ideas! @AndersAskeland

I set up a package repository so we had a place to store our thoughts besides this thread, when our big-brain ideas grow too big for it :)
Also to have a place to keep our raw reflection logs, which can be found here: https://github.com/science-collective/package/tree/master/doc/reflections (if you guys adds yours too). Mine is kind of a brainstorm at the moment. After some cleanup they'll probably get moved to the vignettes section at some point (I couldn't find that section in any of the other repos, feel free to move it as needed @lwjohnst86 )

@Aastedet
Copy link

Aastedet commented Jun 8, 2022

I'm commuting to Copenhagen (SDCC) and was hoping to join you guys on Discord today, but my internet connectivity here is horrible.

I've been wanting to look at stuff to enable a full end-to-end reproducible scientific workflow on "secure" offline servers, such as Statistics Denmark.
For the most part, it's not too bad. They've downloaded and installed all of CRAN and Git, so all packages are available, and you can use version control. And they're quick to update packages or R/RStudio versions if you need them to.

What I found the most frustrating thing when writing my paper on Statistics Denmark's servers was having to type in your bibliography BY HAND! (you can't even copy-paste).
Nobody should be forces to do that - ever 😄

So, I've been looking at ways of creating an offline database of pubmed citations in a .bib-friendly format. For starters, I envision a minimalistic package that simply contains an attached object with citation strings for every paper ever published in a PubMed-indexed journal. Or maybe by decade or year, if size becomes a problem. I imagine being able to search the citations with regex patterns.

There's already the easyPubMed-package that queries PubMed citation (https://cran.r-project.org/web/packages/easyPubMed/vignettes/getting_started_with_easyPubMed.html), but that won't work in an offline environment. I hope to create something similar, but with less functionality, as I don't intended the tool to be used to search the literature, but just as a way to facilitate citation of the papers you've already found.

The first step would be to download the MEDLINE/PubMed citation library via their FTP, then process it to scrape away unneccessary stuff, e.g. abstracts. And then save the remaining content in a file/format/object that can be embedded in an R package and read by R.
The main issue is that it's a huge library, so CRAN is not going to like the size of the attached file, even when it has been scraped down to just the barebones citation.
Using a short-text-compression algorithm like Brotli (https://cran.r-project.org/web/packages/brotli/vignettes/benchmarks.html) to compress/decompress the file should reduce the size to roughly 20% of the initial size, with high decompression speed at the user-level. Another way might be to process the citation string to a data frame format (e.g. variables for title, author1, author2, authorN, journal, year, volume, doi etc.) which R should compress even more efficiently (all variable will have a limited number of levels except for the title and doi link). Should be a fun exercise in regex'es.

Again, splitting the package into separate packages for each decade/year/field is an option to reduce the size of each package and increase user-level performance (less data to open/decompress = faster).
Failing that, that package could just reside outside of CRAN, e.g. on GitHub, and Statistics Denmark or other server managers can be asked to download it from there.

What do you think? Sound like a painful fun project, right? 😄

@AndersAskeland
Copy link

AndersAskeland commented Jun 8, 2022

A great idea!

One could perhaps use the PubMed API (https://pubmed.ncbi.nlm.nih.gov/download/) on install to download data. That way you would not have to include data in any package files. A added benefit would be that the package could update the article database without having to update the package.

However, there might be an bottleneck related to database lookup, wherein I think R might struggle searching large files directly (i.e. a large .bib file). I am unsure if R provides good tooling for databases, but I imagine one would need to store the data in some sort of relational database (sql'ish), and generate smaller .bib files based on lookup. I think dbplyr could work, but I am unsure.

@lwjohnst86
Copy link
Member Author

Yea, really cool idea. One difficulty would be that CRAN doesn't allow data files to be larger than 5Mb, so that puts a major limitation to uploading to CRAN and getting access via the server. I like the idea of having it as a GitHub only package. What are DST's policies regarding that?

Alternatively (and this isn't about making a package), you could write out the bib citation key in the Markdown file and when you download it from DST, knit it outside DST. 🤷

@Aastedet
Copy link

Aastedet commented Jun 8, 2022

A great idea!

One could perhaps use the PubMed API (https://pubmed.ncbi.nlm.nih.gov/download/) on install to download data. That way you would not have to include data in any package files. A added benefit would be that the package could update the article database without having to update the package.

However, there might be an bottleneck related to database lookup, wherein I think R might struggle searching large files directly (i.e. a large .bib file). I am unsure if R provides good tooling for databases, but I imagine one would need to store the data in some sort of relational database (sql'ish), and generate smaller .bib files based on lookup. I think dbplyr could work, but I am unsure.

Yeah, I just realized CRAN is even more restrictive on package size than I thought (5MB limit), so embedding the citations is off the table there. This leaves two options:

  1. have the package look up the citations. This is similar to the easyPubMed-package, which queries the PubMed API. It might still provide some added benefits/performance compared to easyPubMed if we can make a more barebones solution.
  2. Download and clean the data, and put the data/package on GitHub or some other hosting service, and ask DST to download it.

I don't know if DST can be asked to download the whole PubMed/MEDLINE library and put it on their network drive. In that case the package could just be directed to the local folder.
Alternatively, with the right hosting, maybe the two options can be combined, so the package looks up the barebones/cleaned citations (makes the package fit on CRAN), and the citations can also be downloaded and hosted locally and the package can be directed to look them up there (makes it accessible to offline environments). Downside is that the contents would need to be updated/downloaded again routinely.

Don't know if the purpose is too niche to justify a solution. I'll try to look into it, at least.

@Aastedet
Copy link

Aastedet commented Nov 4, 2022

I spent half a day reading up on xml files and starting a repo:
https://github.com/Aastedet/pubmedciteR

I think it's doable, but the pubmed xml-files are tricky. I'll see if I can put a few more hours into it, then it should be possible to at least create an R object with citations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants