Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WXS/Targeted - intersect_regions param for call-sSNV in template.config #183

Open
Faizal-Eeman opened this issue Mar 25, 2024 · 13 comments
Open
Assignees

Comments

@Faizal-Eeman
Copy link
Contributor

For non-WGS samples, WXS or Targeted, it is unclear if the call-sSNV param intersect_regions SHOULD BE changed to their respective target BED file as the default if set to Homo_sapiens_assembly38_no-decoy.bed.gz.
Maybe a comment would be of good guidance in this section - to change the default path to a respective BED file.

call_sSNV {
algorithm = ['somaticsniper', 'strelka2', 'mutect2', 'muse']
reference = '/hot/ref/reference/GRCh38-BI-20160721/Homo_sapiens_assembly38.fasta'
exome = false
intersect_regions = '/hot/ref/tool-specific-input/pipeline-call-sSNV-6.0.0/GRCh38-BI-20160721/Homo_sapiens_assembly38_no-decoy.bed.gz'
germline_resource_gnomad_vcf = '/hot/ref/tool-specific-input/GATK/GRCh38/af-only-gnomad.hg38.vcf.gz'
dbSNP = '/hot/ref/database/dbSNP-155/original/GRCh38/GCF_000001405.39.gz'
ncbi_build = 'GRCh38'
}

@tyamaguchi-ucla
Copy link
Contributor

I think it's worth adding a section about WXS/targeted-seq runs to the README as well?

@sorelfitzgibbon
Copy link
Contributor

sorelfitzgibbon commented Mar 25, 2024

@Faizal-Eeman and I think it's worth considering having an intersect_regions input parameter for metapipeline, which then passes these regions appropriately to each sub-pipeline. The user would either provide their targeted regions or the set of WGS contigs minus decoy regions.

@Faizal-Eeman
Copy link
Contributor Author

I second with @sorelfitzgibbon it's better to have intersect_regions as a metapipeline parameter rather than to specify it under each pipeline section.

@Alfredo-Enrique
Copy link
Member

Alfredo-Enrique commented Mar 25, 2024

For non-WGS samples what is the set of all pipelines that use intersect regions?

  • recalibrate bam
  • call-gSNP
  • call-sSNV
  • targeted-coverage

I think that's it right?

@yashpatel6
Copy link
Collaborator

I will disagree on having a single parameter to try to handle intervals. Intervals serve different functions across the pipelines that use them and the absence of intervals means different things. Because of that, users should really be able to understand what the intervals mean for each of the individual pipelines and decide on adding that parameter per-pipeline. Additionally, the concept of a single set of intervals doesn't work since the defaults in WGS mode are different across pipelines, meaning different logic would need to be added at the metapipeline level that really belongs within the individual pipelines.

I agree that the documentation needs to be updated (and it is in the process of being updated) to provide guidelines for the different metapipeline run modes (WGS, WXS, targeted, single sample, paired sample, multi sample).

@sorelfitzgibbon
Copy link
Contributor

sorelfitzgibbon commented Mar 26, 2024

@yashpatel6 can you point to a specific issue with using global intervals? From what I can discern, one set of global intervals would be good in the majority of cases (certainly for WGS). There could be specific pipeline override intervals options, if needed, for occasional cases where e.g. someone wants to calculate coverage on the global interval regions but call variants on larger regions. But would global regions not be good even for most exome runs?

Here's what I see for the set of pipelines used:

  • convert_BAM2FASTQ
    • no intervals used/needed
  • align_DNA
    • no intervals used/needed
  • recalibrate_BAM
    • no intervals used/needed
  • calculate_targeted_coverage
    • target_bed = 'path/to/target/bedfile' //required
      • could be global intervals?
    • bait_bed = '' //optional, path/to/bait/bedfile
  • generate_SQC_BAM
    • intervals still need to be implemented, especially for collectWgsMetrics (coverage)
      • could be global intervals
  • call_gSNP
    • pipeline takes intervals, but metapipeline doesn't seem to be using them?
      • could be global intervals
  • call_sSNV {
    • intersect_regions = '/hot/ref/tool-specific-input/pipeline-call-sSNV-6.0.0/GRCh38-BI20160721/Homo_sapiens_assembly38_no-decoy.bed.gz'
    • could be global intervals
  • call_mtSNV {
    • no intervals used/needed
  • call_gSV
    • inappropriate for exomes?
  • call_sSV
    • inappropriate for exomes?
  • call_sCNA
    • inappropriate for exomes

@tyamaguchi-ucla
Copy link
Contributor

tyamaguchi-ucla commented Mar 26, 2024

@sorelfitzgibbon @yashpatel6

convert-BAM2FASTQ - we could extract reads using an interval BED but generally not ideal (losing off target reads).

call-mtSNV - Ideally, the pipeline should look for chrM (or equivalent) in the interval BED and if True, we should run the pipeline as default. If False, I think we can skip this pipeline unless users want to look for off-target mt-reads.

call-sCNA - FACETS can call CNAs for exomes

@yashpatel6
Copy link
Collaborator

@yashpatel6 can you point to a specific issue with using global intervals? From what I can discern, one set of global intervals would be good in the majority of cases (certainly for WGS). There could be specific pipeline override intervals options, if needed, for occasional cases where e.g. someone wants to calculate coverage on the global interval regions but call variants on larger regions. But would global regions not be good even for most exome runs?

One example: WGS mode for call-sSNV vs. for call-gSNP and recalibrate-BAM - call-sSNV recommends that you use non-decoy intervals for WGS, the others require the intervals to be left empty for default WGS mode. This ends up creating a need for more than one parameter to handle these if the metapipeline were to handle them.

* recalibrate_BAM

recalibrate-BAM does accept/use intervals

  * pipeline takes intervals, but metapipeline doesn't seem to be using them?

The pipeline does take intervals and intervals passed to the metapipeline are passed to the pipeline when given

    * could be global intervals

Not exactly, see example above for the issue in WGS mode

@sorelfitzgibbon @yashpatel6

convert-BAM2FASTQ - we could extract reads using an interval BED but generally not ideal (losing off target reads).

Agreed, the BAM2FASTQ pipeline isn't intended to work on intervals and I don't think it should either

call-mtSNV - Ideally, the pipeline should look for chrM (or equivalent) in the interval BED and if True, we should run the pipeline as default. If False, I think we can skip this pipeline unless users want to look for off-target mt-reads.

If this type of logic would be implemented, it should happen within call-mtSNV; I disagree with using something like intervals at the metapipeline level to forcefully run a pipeline that wasn't requested - that convolutes the whole pipeline selection process and added a lot more confusion to end users about pipelines automatically running when they weren't requested.

call-sCNA - FACETS can call CNAs for exomes

It can but intervals aren't required for FACETS unless I'm mistaken

@tyamaguchi-ucla
Copy link
Contributor

If this type of logic would be implemented, it should happen within call-mtSNV;

Yup, I totally agree on this.

@sorelfitzgibbon
Copy link
Contributor

I believe removing the decoy regions would only be a plus for WGS call-gSNP and BQSR

@Faizal-Eeman
Copy link
Contributor Author

call-sCNA - FACETS can call CNAs for exomes

It can but intervals aren't required for FACETS unless I'm mistaken

FACETS in call-sCNA v3.1.0 isn't set up for exomes but the tool does have an option to input target BED intervals. Something to add at call-sCNA level.

@yashpatel6
Copy link
Collaborator

I believe removing the decoy regions would only be a plus for WGS call-gSNP and BQSR

That may be though I don't think removing decoy regions has actually been assessed in the context of BQSR/call-gSNP. Also, on a conceptual level, even if a read is mapped onto a decoy contig, there's no reason it should be excluded from the base quality score recalibration since the base quality relates more to the actual sequencing quality rather than mapping

@yashpatel6
Copy link
Collaborator

After some discussion, there will be more options in the metapipeline to control behavior for exome/targeted behavior so I'll look into adding a dedicated params section for it. We'll still want to handle a couple of things before that: looking into matching up WGS behavior between call-sSNV and other pipelines (likely with an interval extraction process that automatically removes decoy contigs) and recalibrate-BAM/call-gSNP will need to be looked into for any potential effects of removing decoy contigs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants