-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sfitz compress readcounts #203
Conversation
module/somaticsniper-processes.nf
Outdated
|
||
""" | ||
set -euo pipefail | ||
gzip --stdout $readcount_file > ${readcount_file}.gz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We generally want to recommend bzip2
based on Yash's recent benchmark. https://github.com/uclahs-cds/tool-archive-data/discussions/25 We would also want to encourage lab members to use this package more. @yashpatel6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally agree on bzip2
!
The package can be used but it doesn't have the autobuild action in it yet so the Docker's out of date at the moment (will be fixed soon though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Converting this to draft to wait until bzip2 module is working
…call-sSNV into sfitz-compress-readcounts
…call-sSNV into sfitz-compress-readcounts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! The release of the archive package should happen today or tomorrow so we can update the version before release
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Anything to add @maotian06 or @tyamaguchi-ucla ?
Going back to the original issue - #197 but if I recall, we are not using this readcount file in the SRC pipeline, right? @yashpatel6 |
My understanding is this is temporary since it's only SomaticSniper's SNVs. Should we add a |
Correct, call-SRC doesn't yet support the readcounts file |
It looks like we kept the original and duplicated the file? @sorelfitzgibbon Consider
|
Got it but would it make sense to support this readcount file or do we need to generate a different readcount file using other algorithm, for example? |
By the way, this is to reduce storage costs. We can do |
The process name for this file would actually be
But this is just in intermediate files which usually wouldn't be kept? I was thinking a user could be having trouble with a faulty bzip2 run and want to see that file. But happy to remove it. |
|
How about we add an instruction to the README? We encourage bzip2 compression in the cluster based on this benchmark - https://github.com/uclahs-cds/package-archive-data/discussions/25 |
I'm generally ok with never saving the uncompressed file; it's not currently used anywhere downstream and the QC will contain the compressed file anyways if necessary. |
module/somaticsniper.nf
Outdated
include { compress_file_blarchive} from './common' addParams( | ||
blarchive_publishDir : "${params.workflow_output_dir}/QC/compress_file_blarchive" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor point but if we're hard-coding the directory here anyways, it might make more sense to name it to indicate the readcount process rather than the generic compression process
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tyamaguchi-ucla were you suggesting only keeping the compressed readcount file as an intermediate now, since it's not being used elsewhere?
@yashpatel6 I think I should copy the initOptions from external/pipeline-Nextflow-module/modules/common/index_VCF_tabix/functions.nf
to be used with the local modules/common?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tyamaguchi-ucla were you suggesting only keeping the compressed readcount file as an intermediate now, since it's not being used elsewhere?
I need confirmation from @yashpatel6 about this. I think he's looking into it now.
Going back to the original issue - #197 but if I recall, we are not using this readcount file in the SRC pipeline, right? @yashpatel6
Correct, call-SRC doesn't yet support the readcounts file
Got it but would it make sense to support this readcount file or do we need to generate a different readcount file using other algorithm, for example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sorelfitzgibbon @yashpatel6 Also, I don't think this PR is urgent. If no objections, we can create a release candidate and start running through a few datasets and see if we can make an official release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the readcount file, it should be ok to keep just the compressed output. We may need a different readcount file depending on how we run different samples between call-sSNV and call-SRC (readcount file is particularly useful for filling in read counts for variants that are in some samples but not in others since many SRC algorithms only accept variants that are found in all samples being reconstructed).
I'm also ok with updating the resource allocations (other PR), make a release candidate, and then reconsider the readcounts file if necessary. At the moment, the raw text readcounts file isn't being used downstream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked into this a bit but bam-readcoun v1.0.1 is available and I'm not sure if v1.0.1 is compatible with SomaticSniper. For call-SRC, we'll probably want to use the latest version since v1.0.0 had some changes https://github.com/genome/bam-readcount/releases/tag/v1.0.0. @yashpatel6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor point but if we're hard-coding the directory here anyways, it might make more sense to name it to indicate the readcount process rather than the generic compression process
@sorelfitzgibbon This makes sense to me as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked into this a bit but bam-readcoun v1.0.1 is available and I'm not sure if v1.0.1 is compatible with SomaticSniper. For call-SRC, we'll probably want to use the latest version since v1.0.0 had some changes https://github.com/genome/bam-readcount/releases/tag/v1.0.0. @yashpatel6
For call-SRC, we probably do want to use the latest version, it'll require a little bit of exploring first to figure out the changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked into this a bit but bam-readcoun v1.0.1 is available and I'm not sure if v1.0.1 is compatible with SomaticSniper. For call-SRC, we'll probably want to use the latest version since v1.0.0 had some changes https://github.com/genome/bam-readcount/releases/tag/v1.0.0. @yashpatel6
For call-SRC, we probably do want to use the latest version, it'll require a little bit of exploring first to figure out the changes
Then, I think the decision is to keep the compressed file under /intermediate
. Any concerns? @sorelfitzgibbon @yashpatel6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No concerns from me, we can keep the compressed file under /intermediate
and not save the uncompressed file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! The compressed file is now in the intermediates and a properly compressed bz2 file. Anything to add @tyamaguchi-ucla ?
CHANGELOG.md
Outdated
@@ -7,6 +7,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 | |||
## [Unreleased] | |||
|
|||
### Added | |||
- Add compression of `SomaticSniper` `bam-readcount` QC output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No longer QC output
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
Description
SomaticSniper
bam-readcount
output usingbzip2
and only keep inintermediate
.Closes #199
Testing Results
nftest run a_mini_n2-all-tools-std-input
log:
/hot/software/pipeline/pipeline-call-sSNV/Nextflow/development/unreleased/sfitz-compress-readcounts/log-nftest-20230830T205109Z.log
output:
/hot/software/pipeline/pipeline-call-sSNV/Nextflow/development/unreleased/sfitz-compress-readcounts/a_mini_n2-all-tools-std-input
a_mini all tools with intermediate files saved
config:
/hot/code/sfitzgibbon/gitHub/uclahs-cds/pipeline-call-sSNV/test/config/a_mini-all-tools-save-intermediates.config
yaml:
/hot/code/sfitzgibbon/gitHub/uclahs-cds/pipeline-call-sSNV/test/yaml/a_mini_n2-std-input.yaml
log:
/hot/software/pipeline/pipeline-call-sSNV/Nextflow/development/unreleased/sfitz-compress-readcounts/amini-all-save-intermediates.log
output:
/hot/software/pipeline/pipeline-call-sSNV/Nextflow/development/unreleased/sfitz-compress-readcounts/amini-all-save-intermediates
Checklist
I have read the code review guidelines and the code review best practice on GitHub check-list.
I have reviewed the Nextflow pipeline standards.
The name of the branch is meaningful and well formatted following the standards, using [AD_username (or 5 letters of AD if AD is too long)]-[brief_description_of_branch].
I have set up or verified the branch protection rule following the github standards before opening this pull request.
I have added my name to the contributors listings in the
manifest
block in thenextflow.config
as part of this pull request; I am listed already, or do not wish to be listed. (This acknowledgement is optional.)I have added the changes included in this pull request to the
CHANGELOG.md
under the next release version or unreleased, and updated the date.I have updated the version number in the
metadata.yaml
andmanifest
block of thenextflow.config
file following semver, or the version number has already been updated. (Leave it unchecked if you are unsure about new version number and discuss it with the infrastructure team in this PR.)I have tested the pipeline on at least one A-mini sample.