Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sfitz compress readcounts #203

Merged
merged 27 commits into from
Aug 30, 2023
Merged

Sfitz compress readcounts #203

merged 27 commits into from
Aug 30, 2023

Conversation

sorelfitzgibbon
Copy link
Contributor

@sorelfitzgibbon sorelfitzgibbon commented Jul 17, 2023

Description

  • add compression of SomaticSniper bam-readcount output using bzip2 and only keep in intermediate.
  • change MAF file compression to bzip2
  • fix up log directory and file structure

Closes #199

Testing Results

nftest run a_mini_n2-all-tools-std-input

log: /hot/software/pipeline/pipeline-call-sSNV/Nextflow/development/unreleased/sfitz-compress-readcounts/log-nftest-20230830T205109Z.log
output: /hot/software/pipeline/pipeline-call-sSNV/Nextflow/development/unreleased/sfitz-compress-readcounts/a_mini_n2-all-tools-std-input

a_mini all tools with intermediate files saved

config: /hot/code/sfitzgibbon/gitHub/uclahs-cds/pipeline-call-sSNV/test/config/a_mini-all-tools-save-intermediates.config
yaml: /hot/code/sfitzgibbon/gitHub/uclahs-cds/pipeline-call-sSNV/test/yaml/a_mini_n2-std-input.yaml
log: /hot/software/pipeline/pipeline-call-sSNV/Nextflow/development/unreleased/sfitz-compress-readcounts/amini-all-save-intermediates.log
output: /hot/software/pipeline/pipeline-call-sSNV/Nextflow/development/unreleased/sfitz-compress-readcounts/amini-all-save-intermediates

Checklist

  • I have read the code review guidelines and the code review best practice on GitHub check-list.

  • I have reviewed the Nextflow pipeline standards.

  • The name of the branch is meaningful and well formatted following the standards, using [AD_username (or 5 letters of AD if AD is too long)]-[brief_description_of_branch].

  • I have set up or verified the branch protection rule following the github standards before opening this pull request.

  • I have added my name to the contributors listings in the manifest block in the nextflow.config as part of this pull request; I am listed already, or do not wish to be listed. (This acknowledgement is optional.)

  • I have added the changes included in this pull request to the CHANGELOG.md under the next release version or unreleased, and updated the date.

  • I have updated the version number in the metadata.yaml and manifest block of the nextflow.config file following semver, or the version number has already been updated. (Leave it unchecked if you are unsure about new version number and discuss it with the infrastructure team in this PR.)

  • I have tested the pipeline on at least one A-mini sample.


"""
set -euo pipefail
gzip --stdout $readcount_file > ${readcount_file}.gz
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We generally want to recommend bzip2 based on Yash's recent benchmark. https://github.com/uclahs-cds/tool-archive-data/discussions/25 We would also want to encourage lab members to use this package more. @yashpatel6

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally agree on bzip2!

The package can be used but it doesn't have the autobuild action in it yet so the Docker's out of date at the moment (will be fixed soon though)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converting this to draft to wait until bzip2 module is working

@sorelfitzgibbon sorelfitzgibbon changed the base branch from main to sfitz-update-readme August 12, 2023 04:21
@sorelfitzgibbon sorelfitzgibbon marked this pull request as ready for review August 12, 2023 04:29
@sorelfitzgibbon sorelfitzgibbon changed the base branch from sfitz-update-readme to main August 19, 2023 20:17
Copy link
Contributor

@yashpatel6 yashpatel6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! The release of the archive package should happen today or tomorrow so we can update the version before release

Copy link
Contributor

@yashpatel6 yashpatel6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Anything to add @maotian06 or @tyamaguchi-ucla ?

@tyamaguchi-ucla
Copy link
Contributor

Going back to the original issue - #197 but if I recall, we are not using this readcount file in the SRC pipeline, right? @yashpatel6

@sorelfitzgibbon
Copy link
Contributor Author

Going back to the original issue - #197 but if I recall, we are not using this readcount file in the SRC pipeline, right? @yashpatel6

My understanding is this is temporary since it's only SomaticSniper's SNVs. Should we add a readcounts step for the final set of variants after intersection?

@yashpatel6
Copy link
Contributor

Going back to the original issue - #197 but if I recall, we are not using this readcount file in the SRC pipeline, right? @yashpatel6

Correct, call-SRC doesn't yet support the readcounts file

@tyamaguchi-ucla
Copy link
Contributor

It looks like we kept the original and duplicated the file? @sorelfitzgibbon

Consider

  • removing the uncompressed intermediate file
  • add ${process-name} to the QC folder (e.g.QC/generate_ReadCount_bam_readcount/SomaticSniper-1.0.5.0_TWGSAMIN_TWGSAMIN000001-T001-S01-F.readcount.bz2)

@tyamaguchi-ucla
Copy link
Contributor

Going back to the original issue - #197 but if I recall, we are not using this readcount file in the SRC pipeline, right? @yashpatel6

Correct, call-SRC doesn't yet support the readcounts file

Got it but would it make sense to support this readcount file or do we need to generate a different readcount file using other algorithm, for example?

@tyamaguchi-ucla
Copy link
Contributor

It looks like we kept the original and duplicated the file? @sorelfitzgibbon

Consider

  • removing the uncompressed intermediate file
  • add ${process-name} to the QC folder (e.g.QC/generate_ReadCount_bam_readcount/SomaticSniper-1.0.5.0_TWGSAMIN_TWGSAMIN000001-T001-S01-F.readcount.bz2)

By the way, this is to reduce storage costs. We can do less *.bz2 (or bzless *.bz2) to read the file and it's specifically included in the cluster training. We should avoid duplicating files in general.

@sorelfitzgibbon
Copy link
Contributor Author

sorelfitzgibbon commented Aug 24, 2023

  • add ${process-name} to the QC folder (e.g.QC/generate_ReadCount_bam_readcount/SomaticSniper-1.0.5.0_TWGSAMIN_TWGSAMIN000001-T001-S01-F.readcount.bz2)

The process name for this file would actually be compress_file_blarchive. Is that

It looks like we kept the original and duplicated the file? @sorelfitzgibbon
Consider

  • removing the uncompressed intermediate file
  • add ${process-name} to the QC folder (e.g.QC/generate_ReadCount_bam_readcount/SomaticSniper-1.0.5.0_TWGSAMIN_TWGSAMIN000001-T001-S01-F.readcount.bz2)

By the way, this is to reduce storage costs. We can do less *.bz2 (or bzless *.bz2) to read the file and it's specifically included in the cluster training. We should avoid duplicating files in general.

But this is just in intermediate files which usually wouldn't be kept? I was thinking a user could be having trouble with a faulty bzip2 run and want to see that file. But happy to remove it.

@sorelfitzgibbon
Copy link
Contributor Author

  • add ${process-name} to the QC folder (e.g.QC/generate_ReadCount_bam_readcount/SomaticSniper-1.0.5.0_TWGSAMIN_TWGSAMIN000001-T001-S01-F.readcount.bz2)

${process-name} here would actually be compress_file_blarchive. Is that what you want, or should I hard code generate_ReadCount_bam_readcount?

@tyamaguchi-ucla
Copy link
Contributor

I was thinking a user could be having trouble with a faulty bzip2 run and want to see that file.

How about we add an instruction to the README? We encourage bzip2 compression in the cluster based on this benchmark - https://github.com/uclahs-cds/package-archive-data/discussions/25

@yashpatel6
Copy link
Contributor

I'm generally ok with never saving the uncompressed file; it's not currently used anywhere downstream and the QC will contain the compressed file anyways if necessary.

Comment on lines 18 to 20
include { compress_file_blarchive} from './common' addParams(
blarchive_publishDir : "${params.workflow_output_dir}/QC/compress_file_blarchive"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor point but if we're hard-coding the directory here anyways, it might make more sense to name it to indicate the readcount process rather than the generic compression process

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tyamaguchi-ucla were you suggesting only keeping the compressed readcount file as an intermediate now, since it's not being used elsewhere?
@yashpatel6 I think I should copy the initOptions from external/pipeline-Nextflow-module/modules/common/index_VCF_tabix/functions.nf to be used with the local modules/common?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tyamaguchi-ucla were you suggesting only keeping the compressed readcount file as an intermediate now, since it's not being used elsewhere?

I need confirmation from @yashpatel6 about this. I think he's looking into it now.

Going back to the original issue - #197 but if I recall, we are not using this readcount file in the SRC pipeline, right? @yashpatel6

Correct, call-SRC doesn't yet support the readcounts file

Got it but would it make sense to support this readcount file or do we need to generate a different readcount file using other algorithm, for example?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sorelfitzgibbon @yashpatel6 Also, I don't think this PR is urgent. If no objections, we can create a release candidate and start running through a few datasets and see if we can make an official release.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the readcount file, it should be ok to keep just the compressed output. We may need a different readcount file depending on how we run different samples between call-sSNV and call-SRC (readcount file is particularly useful for filling in read counts for variants that are in some samples but not in others since many SRC algorithms only accept variants that are found in all samples being reconstructed).

I'm also ok with updating the resource allocations (other PR), make a release candidate, and then reconsider the readcounts file if necessary. At the moment, the raw text readcounts file isn't being used downstream.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into this a bit but bam-readcoun v1.0.1 is available and I'm not sure if v1.0.1 is compatible with SomaticSniper. For call-SRC, we'll probably want to use the latest version since v1.0.0 had some changes https://github.com/genome/bam-readcount/releases/tag/v1.0.0. @yashpatel6

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor point but if we're hard-coding the directory here anyways, it might make more sense to name it to indicate the readcount process rather than the generic compression process

@sorelfitzgibbon This makes sense to me as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into this a bit but bam-readcoun v1.0.1 is available and I'm not sure if v1.0.1 is compatible with SomaticSniper. For call-SRC, we'll probably want to use the latest version since v1.0.0 had some changes https://github.com/genome/bam-readcount/releases/tag/v1.0.0. @yashpatel6

For call-SRC, we probably do want to use the latest version, it'll require a little bit of exploring first to figure out the changes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into this a bit but bam-readcoun v1.0.1 is available and I'm not sure if v1.0.1 is compatible with SomaticSniper. For call-SRC, we'll probably want to use the latest version since v1.0.0 had some changes https://github.com/genome/bam-readcount/releases/tag/v1.0.0. @yashpatel6

For call-SRC, we probably do want to use the latest version, it'll require a little bit of exploring first to figure out the changes

Then, I think the decision is to keep the compressed file under /intermediate. Any concerns? @sorelfitzgibbon @yashpatel6

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No concerns from me, we can keep the compressed file under /intermediate and not save the uncompressed file

Copy link
Contributor

@yashpatel6 yashpatel6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! The compressed file is now in the intermediates and a properly compressed bz2 file. Anything to add @tyamaguchi-ucla ?

CHANGELOG.md Outdated
@@ -7,6 +7,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
## [Unreleased]

### Added
- Add compression of `SomaticSniper` `bam-readcount` QC output
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer QC output

Copy link
Contributor

@tyamaguchi-ucla tyamaguchi-ucla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@sorelfitzgibbon sorelfitzgibbon merged commit 1ac35bd into main Aug 30, 2023
1 check passed
@sorelfitzgibbon sorelfitzgibbon deleted the sfitz-compress-readcounts branch September 15, 2023 19:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

compress readcount file
3 participants