Sfitz compress readcounts #203

sorelfitzgibbon · 2023-07-17T17:53:36Z

Description

add compression of SomaticSniper bam-readcount output using bzip2 and only keep in intermediate.
change MAF file compression to bzip2
fix up log directory and file structure

Closes #199

Testing Results

`nftest run a_mini_n2-all-tools-std-input`

log: /hot/software/pipeline/pipeline-call-sSNV/Nextflow/development/unreleased/sfitz-compress-readcounts/log-nftest-20230830T205109Z.log
output: /hot/software/pipeline/pipeline-call-sSNV/Nextflow/development/unreleased/sfitz-compress-readcounts/a_mini_n2-all-tools-std-input

a_mini all tools with intermediate files saved

config: /hot/code/sfitzgibbon/gitHub/uclahs-cds/pipeline-call-sSNV/test/config/a_mini-all-tools-save-intermediates.config
yaml: /hot/code/sfitzgibbon/gitHub/uclahs-cds/pipeline-call-sSNV/test/yaml/a_mini_n2-std-input.yaml
log: /hot/software/pipeline/pipeline-call-sSNV/Nextflow/development/unreleased/sfitz-compress-readcounts/amini-all-save-intermediates.log
output: /hot/software/pipeline/pipeline-call-sSNV/Nextflow/development/unreleased/sfitz-compress-readcounts/amini-all-save-intermediates

Checklist

I have read the code review guidelines and the code review best practice on GitHub check-list.
I have reviewed the Nextflow pipeline standards.
The name of the branch is meaningful and well formatted following the standards, using [AD_username (or 5 letters of AD if AD is too long)]-[brief_description_of_branch].
I have set up or verified the branch protection rule following the github standards before opening this pull request.
I have added my name to the contributors listings in the manifest block in the nextflow.config as part of this pull request; I am listed already, or do not wish to be listed. (This acknowledgement is optional.)
I have added the changes included in this pull request to the CHANGELOG.md under the next release version or unreleased, and updated the date.
I have updated the version number in the metadata.yaml and manifest block of the nextflow.config file following semver, or the version number has already been updated. (Leave it unchecked if you are unsure about new version number and discuss it with the infrastructure team in this PR.)
I have tested the pipeline on at least one A-mini sample.

tyamaguchi-ucla · 2023-07-18T20:55:17Z

module/somaticsniper-processes.nf

+
+    """
+    set -euo pipefail
+    gzip --stdout $readcount_file > ${readcount_file}.gz


We generally want to recommend bzip2 based on Yash's recent benchmark. https://github.com/uclahs-cds/tool-archive-data/discussions/25 We would also want to encourage lab members to use this package more. @yashpatel6

Generally agree on bzip2!

The package can be used but it doesn't have the autobuild action in it yet so the Docker's out of date at the moment (will be fixed soon though)

Converting this to draft to wait until bzip2 module is working

…call-sSNV into sfitz-compress-readcounts

…sfitz-compress-readcounts

yashpatel6

Looks good! The release of the archive package should happen today or tomorrow so we can update the version before release

yashpatel6

Looks good! Anything to add @maotian06 or @tyamaguchi-ucla ?

tyamaguchi-ucla · 2023-08-23T23:06:13Z

Going back to the original issue - #197 but if I recall, we are not using this readcount file in the SRC pipeline, right? @yashpatel6

sorelfitzgibbon · 2023-08-23T23:12:09Z

Going back to the original issue - #197 but if I recall, we are not using this readcount file in the SRC pipeline, right? @yashpatel6

My understanding is this is temporary since it's only SomaticSniper's SNVs. Should we add a readcounts step for the final set of variants after intersection?

yashpatel6 · 2023-08-24T00:20:20Z

Going back to the original issue - #197 but if I recall, we are not using this readcount file in the SRC pipeline, right? @yashpatel6

Correct, call-SRC doesn't yet support the readcounts file

tyamaguchi-ucla · 2023-08-24T23:04:48Z

It looks like we kept the original and duplicated the file? @sorelfitzgibbon

Consider

removing the uncompressed intermediate file
add ${process-name} to the QC folder (e.g.QC/generate_ReadCount_bam_readcount/SomaticSniper-1.0.5.0_TWGSAMIN_TWGSAMIN000001-T001-S01-F.readcount.bz2)

tyamaguchi-ucla · 2023-08-24T23:07:07Z

Going back to the original issue - #197 but if I recall, we are not using this readcount file in the SRC pipeline, right? @yashpatel6

Correct, call-SRC doesn't yet support the readcounts file

Got it but would it make sense to support this readcount file or do we need to generate a different readcount file using other algorithm, for example?

tyamaguchi-ucla · 2023-08-24T23:16:22Z

It looks like we kept the original and duplicated the file? @sorelfitzgibbon

Consider

removing the uncompressed intermediate file

add ${process-name} to the QC folder (e.g.QC/generate_ReadCount_bam_readcount/SomaticSniper-1.0.5.0_TWGSAMIN_TWGSAMIN000001-T001-S01-F.readcount.bz2)

By the way, this is to reduce storage costs. We can do less *.bz2 (or bzless *.bz2) to read the file and it's specifically included in the cluster training. We should avoid duplicating files in general.

sorelfitzgibbon · 2023-08-24T23:18:00Z

add ${process-name} to the QC folder (e.g.QC/generate_ReadCount_bam_readcount/SomaticSniper-1.0.5.0_TWGSAMIN_TWGSAMIN000001-T001-S01-F.readcount.bz2)

The process name for this file would actually be compress_file_blarchive. Is that

It looks like we kept the original and duplicated the file? @sorelfitzgibbon
Consider

removing the uncompressed intermediate file

add ${process-name} to the QC folder (e.g.QC/generate_ReadCount_bam_readcount/SomaticSniper-1.0.5.0_TWGSAMIN_TWGSAMIN000001-T001-S01-F.readcount.bz2)

By the way, this is to reduce storage costs. We can do less *.bz2 (or bzless *.bz2) to read the file and it's specifically included in the cluster training. We should avoid duplicating files in general.

But this is just in intermediate files which usually wouldn't be kept? I was thinking a user could be having trouble with a faulty bzip2 run and want to see that file. But happy to remove it.

sorelfitzgibbon · 2023-08-24T23:20:56Z

add ${process-name} to the QC folder (e.g.QC/generate_ReadCount_bam_readcount/SomaticSniper-1.0.5.0_TWGSAMIN_TWGSAMIN000001-T001-S01-F.readcount.bz2)

${process-name} here would actually be compress_file_blarchive. Is that what you want, or should I hard code generate_ReadCount_bam_readcount?

tyamaguchi-ucla · 2023-08-24T23:28:33Z

I was thinking a user could be having trouble with a faulty bzip2 run and want to see that file.

How about we add an instruction to the README? We encourage bzip2 compression in the cluster based on this benchmark - https://github.com/uclahs-cds/package-archive-data/discussions/25

…dcount.bz2 path

yashpatel6 · 2023-08-25T00:41:26Z

I'm generally ok with never saving the uncompressed file; it's not currently used anywhere downstream and the QC will contain the compressed file anyways if necessary.

yashpatel6 · 2023-08-25T01:55:32Z

module/somaticsniper.nf

+include { compress_file_blarchive} from './common'   addParams(
+    blarchive_publishDir : "${params.workflow_output_dir}/QC/compress_file_blarchive"
+    ) 


Minor point but if we're hard-coding the directory here anyways, it might make more sense to name it to indicate the readcount process rather than the generic compression process

@tyamaguchi-ucla were you suggesting only keeping the compressed readcount file as an intermediate now, since it's not being used elsewhere?
@yashpatel6 I think I should copy the initOptions from external/pipeline-Nextflow-module/modules/common/index_VCF_tabix/functions.nf to be used with the local modules/common?

@tyamaguchi-ucla were you suggesting only keeping the compressed readcount file as an intermediate now, since it's not being used elsewhere?

I need confirmation from @yashpatel6 about this. I think he's looking into it now.

Going back to the original issue - #197 but if I recall, we are not using this readcount file in the SRC pipeline, right? @yashpatel6

Correct, call-SRC doesn't yet support the readcounts file

Got it but would it make sense to support this readcount file or do we need to generate a different readcount file using other algorithm, for example?

@sorelfitzgibbon @yashpatel6 Also, I don't think this PR is urgent. If no objections, we can create a release candidate and start running through a few datasets and see if we can make an official release.

Regarding the readcount file, it should be ok to keep just the compressed output. We may need a different readcount file depending on how we run different samples between call-sSNV and call-SRC (readcount file is particularly useful for filling in read counts for variants that are in some samples but not in others since many SRC algorithms only accept variants that are found in all samples being reconstructed).

I'm also ok with updating the resource allocations (other PR), make a release candidate, and then reconsider the readcounts file if necessary. At the moment, the raw text readcounts file isn't being used downstream.

I looked into this a bit but bam-readcoun v1.0.1 is available and I'm not sure if v1.0.1 is compatible with SomaticSniper. For call-SRC, we'll probably want to use the latest version since v1.0.0 had some changes https://github.com/genome/bam-readcount/releases/tag/v1.0.0. @yashpatel6

Minor point but if we're hard-coding the directory here anyways, it might make more sense to name it to indicate the readcount process rather than the generic compression process

@sorelfitzgibbon This makes sense to me as well.

I looked into this a bit but bam-readcoun v1.0.1 is available and I'm not sure if v1.0.1 is compatible with SomaticSniper. For call-SRC, we'll probably want to use the latest version since v1.0.0 had some changes https://github.com/genome/bam-readcount/releases/tag/v1.0.0. @yashpatel6

For call-SRC, we probably do want to use the latest version, it'll require a little bit of exploring first to figure out the changes

I looked into this a bit but bam-readcoun v1.0.1 is available and I'm not sure if v1.0.1 is compatible with SomaticSniper. For call-SRC, we'll probably want to use the latest version since v1.0.0 had some changes https://github.com/genome/bam-readcount/releases/tag/v1.0.0. @yashpatel6

For call-SRC, we probably do want to use the latest version, it'll require a little bit of exploring first to figure out the changes

Then, I think the decision is to keep the compressed file under /intermediate. Any concerns? @sorelfitzgibbon @yashpatel6

No concerns from me, we can keep the compressed file under /intermediate and not save the uncompressed file

yashpatel6

Looks good! The compressed file is now in the intermediates and a properly compressed bz2 file. Anything to add @tyamaguchi-ucla ?

yashpatel6 · 2023-08-30T22:21:46Z

CHANGELOG.md

@@ -7,6 +7,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ## [Unreleased]

 ### Added
+- Add compression of `SomaticSniper` `bam-readcount` QC output


No longer QC output

tyamaguchi-ucla

Looks good to me!

sorelfitzgibbon added 3 commits July 15, 2023 17:10

compress QC readcount file

7b095bb

typos

d1a2055

gzip --stdout to avoid circular links

d597a37

sorelfitzgibbon requested review from maotian06, tyamaguchi-ucla and yashpatel6 July 17, 2023 17:54

tyamaguchi-ucla reviewed Jul 18, 2023

View reviewed changes

tyamaguchi-ucla assigned yashpatel6 Jul 18, 2023

sorelfitzgibbon marked this pull request as draft July 28, 2023 22:25

sorelfitzgibbon added 2 commits August 11, 2023 16:10

add compress_readcount process

bdaf8d0

merge in accidental new branch

0dac513

sorelfitzgibbon changed the base branch from main to sfitz-update-readme August 12, 2023 04:21

sorelfitzgibbon marked this pull request as ready for review August 12, 2023 04:29

sorelfitzgibbon added 10 commits August 14, 2023 15:53

Merge branch 'sfitz-update-readme' of github.com:uclahs-cds/pipeline-…

46b1f4e

…call-sSNV into sfitz-compress-readcounts

Merge branch 'sfitz-update-readme' of github.com:uclahs-cds/pipeline-…

cfe24fd

…call-sSNV into sfitz-compress-readcounts

change to bzip2 in progress

a3e0acd

gzip -> bzip2 still in progress

1276e98

fix blarchive docker typo

caecb59

deference file for bzip2

4f3afaf

maf gzip to bzip2

481e185

finish changing maf compression to bzip2

b1cd3cd

merge in sfitz-update-readme

d355f92

update changelog

30463a5

sorelfitzgibbon requested a review from tyamaguchi-ucla August 16, 2023 21:33

sorelfitzgibbon added 2 commits August 18, 2023 12:26

move bzip2 process to common in progress

368c305

move bzip2 process to common still in progress

f559ae4

sorelfitzgibbon changed the base branch from sfitz-update-readme to main August 19, 2023 20:17

sorelfitzgibbon added 3 commits August 19, 2023 13:23

Merge branch 'main' of github.com:uclahs-cds/pipeline-call-sSNV into …

c027a26

…sfitz-compress-readcounts

in progress

6ac0ef1

bzip2 to common complete, pipeline may be hanging upon completion

fe6c3ad

sorelfitzgibbon added 2 commits August 20, 2023 13:10

final bz2, fix log output dirs, indentation

3c71cbe

update changelog

3bca0be

yashpatel6 approved these changes Aug 21, 2023

View reviewed changes

update to blarchive v2.0.0

ee8b4a1

yashpatel6 approved these changes Aug 23, 2023

View reviewed changes

rm readcount from intermediate and add compress_file_blarchive to rea…

204a08f

…dcount.bz2 path

yashpatel6 reviewed Aug 25, 2023

View reviewed changes

sorelfitzgibbon added 2 commits August 28, 2023 12:54

change readcount log folder name

8d66e15

move compressed readcount output to intermediate

c09c3c8

sorelfitzgibbon requested a review from yashpatel6 August 30, 2023 21:16

yashpatel6 approved these changes Aug 30, 2023

View reviewed changes

fix changelog

a1b797e

tyamaguchi-ucla approved these changes Aug 30, 2023

View reviewed changes

sorelfitzgibbon merged commit 1ac35bd into main Aug 30, 2023
1 check passed

sorelfitzgibbon deleted the sfitz-compress-readcounts branch September 15, 2023 19:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sfitz compress readcounts #203

Sfitz compress readcounts #203

sorelfitzgibbon commented Jul 17, 2023 •

edited

Loading

tyamaguchi-ucla Jul 18, 2023

yashpatel6 Jul 18, 2023

sorelfitzgibbon Jul 28, 2023

yashpatel6 left a comment

yashpatel6 left a comment

tyamaguchi-ucla commented Aug 23, 2023

sorelfitzgibbon commented Aug 23, 2023

yashpatel6 commented Aug 24, 2023

tyamaguchi-ucla commented Aug 24, 2023

tyamaguchi-ucla commented Aug 24, 2023

tyamaguchi-ucla commented Aug 24, 2023

sorelfitzgibbon commented Aug 24, 2023 •

edited

Loading

sorelfitzgibbon commented Aug 24, 2023

tyamaguchi-ucla commented Aug 24, 2023

yashpatel6 commented Aug 25, 2023

yashpatel6 Aug 25, 2023

sorelfitzgibbon Aug 25, 2023

tyamaguchi-ucla Aug 25, 2023

tyamaguchi-ucla Aug 25, 2023

yashpatel6 Aug 25, 2023

tyamaguchi-ucla Aug 29, 2023

tyamaguchi-ucla Aug 29, 2023

yashpatel6 Aug 29, 2023

tyamaguchi-ucla Aug 29, 2023

yashpatel6 Aug 30, 2023

yashpatel6 left a comment

yashpatel6 Aug 30, 2023

tyamaguchi-ucla left a comment

Sfitz compress readcounts #203

Sfitz compress readcounts #203

Conversation

sorelfitzgibbon commented Jul 17, 2023 • edited Loading

Description

Closes #199

Testing Results

nftest run a_mini_n2-all-tools-std-input

a_mini all tools with intermediate files saved

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yashpatel6 left a comment

Choose a reason for hiding this comment

yashpatel6 left a comment

Choose a reason for hiding this comment

tyamaguchi-ucla commented Aug 23, 2023

sorelfitzgibbon commented Aug 23, 2023

yashpatel6 commented Aug 24, 2023

tyamaguchi-ucla commented Aug 24, 2023

tyamaguchi-ucla commented Aug 24, 2023

tyamaguchi-ucla commented Aug 24, 2023

sorelfitzgibbon commented Aug 24, 2023 • edited Loading

sorelfitzgibbon commented Aug 24, 2023

tyamaguchi-ucla commented Aug 24, 2023

yashpatel6 commented Aug 25, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yashpatel6 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tyamaguchi-ucla left a comment

Choose a reason for hiding this comment

sorelfitzgibbon commented Jul 17, 2023 •

edited

Loading

`nftest run a_mini_n2-all-tools-std-input`

sorelfitzgibbon commented Aug 24, 2023 •

edited

Loading