Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert sim_fixed_n() to use data.table internally #114

Merged
merged 1 commit into from
Oct 11, 2023

Conversation

jdblischak
Copy link
Collaborator

This PR begins the migration to use data.table as the table backend (#111). I started by converting the function sim_fixed_n() (and all its downstream functions). This resulted in a ~5x speedup (#111 (comment))

While working on the data.table conversion, I also made a few other changes:

  • I added the packages stats and utils to the list of Imports. While these are typically always installed alongside the base package, it's best practice to list all the packages your package uses
  • I added the trigger workflow_dispatch to the R CMD check workflow, which allows you to manually trigger a pipeline from any branch. The workflow is setup to only run on the "main" branch, and I like to be able to test the changes I make on a feature branch before opening a PR
  • To ensure the changes I made are 100% backwards compatible, I created a test setup where if the environment variable SIMTRIAL_TEST_BACKWARDS_COMPATIBILITY_REF is defined, the tests will install that previous version of simtrial, generates results, and then tests that the current version is compatible. This avoids bloating the Git repo and the R package with many .rds files
  • For the functions that I converted, I made them all return data frames. This ensures the output is compatible with whatever the end user prefers (eg data frame, tibble, data.table). A more sophisticated option would be to capture the class of the input object and then restore that class before returning the object. I didn't do this since it would clutter up the code somewhat, and if no user explicitly requests this feature, it's probably not worth it

@nanxstats
Copy link
Collaborator

Nice. I fully appreciate the clean approach to compare to the dplyr version for backward compatibility and great context - I think the team will want to see and comment on this.

It would be useful to safeguard these internet-dependent tests using skip_if_offline() and skip_on_cran() because they can cause trouble in air-gapped environments and CRAN.

@nanxstats
Copy link
Collaborator

Great discussions today. I guess to ensure the accuracy of our data.table implementation, it's essential to compare the refactored code with the original dplyr implementation. However, there seems to be some potential long-term maintenance concerns if we compare to a specific commit of the original dplyr version, such as:

  • Inability to substantially update the new data.table API without causing test failures.
  • Risks of the old commit not running properly if dplyr changes its API.

So, I brainstormed two possible solutions to provide flexibility for minor future updates to the dplyr version if needed:

  1. Store the dplyr reference implementation in a separate branch: Create a branch, for example, dplyr-reference-implementation, store the dplyr version in this branch. Clearly document the purpose of the branch to avoid confusion. Clone and compare to this branch.

  2. Dual implementation tests: Maintain the dplyr implementation in the main branch but as live, unit testing code.

What do you think?

Copy link
Collaborator

@nanxstats nanxstats left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussion, the team decides that the plan is to remove the backward compatibility test after finishing the conversion, so we only need to maintain one implementation but not two.

All looking good me. Let's merge this!

@nanxstats nanxstats merged commit 0f06d96 into Merck:main Oct 11, 2023
7 checks passed
@jdblischak jdblischak deleted the data.table-pr branch October 23, 2023 19:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants