Skip to content

Commit

Permalink
Merge pull request #225 from uclahs-cds/sfitz-update-resource-allocation
Browse files Browse the repository at this point in the history
Sfitz update resource allocation
  • Loading branch information
sorelfitzgibbon authored Oct 23, 2023
2 parents 51e870d + a39c887 commit ad29da1
Show file tree
Hide file tree
Showing 12 changed files with 198 additions and 20 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Update plot-venn.R to work with all numbers of algorithms greater than two

### Added
- Custom resource allocation updates through configuration parameters
- Add assertions to `nftest`
- Add compression of `SomaticSniper` `bam-readcount` output and move to `intermediate` directory
- Add `ncbi_build` parameter
Expand Down
75 changes: 65 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -182,18 +182,67 @@ input:
### input.config ([see template](config/template.config))
| Input | Required | Type | Description |
|--------|---|--------|-------------------------------------------|
| algorithm | yes | list | List containing a combination of somaticsniper, strelka2, mutect2 and muse |
| reference | yes | string | The reference .fa file (.fai and .dict file must exist in same directory) |
| intersect_regions* | yes | string | A bed file listing the genomic regions for variant calling. Excluding `decoy` regions is HIGHLY recommended *
| output_dir | yes | string | The location where outputs will be saved |
| dataset_id | yes | string | The name/ID of the dataset |
| exome | yes | boolean | The option will be used by `Strelka2` and `MuSE`. When `true`, it will add the `--exome` option to Manta and Strelka2, and `-E` option to MuSE |
| save_intermediate_files | yes | boolean | Whether to save intermediate files |
| work_dir | no | string | The path of working directory for Nextflow, storing intermediate files and logs. The default is `/scratch` with `ucla_cds` and should only be changed for testing/development. Changing this directory to `/hot` or `/tmp` can lead to high server latency and potential disk space limitations, respectively |
| docker_container_registry | no | string | Registry containing tool Docker images, optional. Default: `ghcr.io/uclahs-cds` |
| `algorithm` | yes | list | List containing a combination of somaticsniper, strelka2, mutect2 and muse |
| `reference` | yes | string | The reference .fa file (.fai and .dict file must exist in same directory) |
| `intersect_regions`* | yes | string | A bed file listing the genomic regions for variant calling. Excluding `decoy` regions is HIGHLY recommended *
| `output_dir` | yes | string | The location where outputs will be saved |
| `dataset_id` | yes | string | The name/ID of the dataset |
| `exome` | yes | boolean | The option will be used by `Strelka2` and `MuSE`. When `true`, it will add the `--exome` option to Manta and Strelka2, and `-E` option to MuSE |
| `save_intermediate_files` | yes | boolean | Whether to save intermediate files |
| `work_dir` | no | string | The path of working directory for Nextflow, storing intermediate files and logs. The default is `/scratch` with `ucla_cds` and should only be changed for testing/development. Changing this directory to `/hot` or `/tmp` can lead to high server latency and potential disk space limitations, respectively |
| `docker_container_registry` | no | string | Registry containing tool Docker images, optional. Default: `ghcr.io/uclahs-cds` |
| `base_resource_update` | optional | namespace | Namespace of parameters to update base resource allocations in the pipeline. Usage and structure are detailed in `template.config` and below. |

*Providing `intersect_regions` is required and will limit the final output to just those regions. All regions of the reference genome could be provided as a `bed` file with all contigs, however it is HIGHLY recommended to remove `decoy` contigs from the human reference genome. Including these thousands of small contigs will require the user to increase available memory for `Mutect2` and will cause a very long runtime for `Strelka2`. See [Discussion here](https://github.com/uclahs-cds/pipeline-call-sSNV/discussions/216). A GRCh38 `bed.gz` file can be found here: `/hot/ref/tool-specific-input/pipeline-call-sSNV-6.0.0/GRCh38-BI-20160721/Homo_sapiens_assembly38_no-decoy.bed.gz`. For other genome versions, you may be able to use [UCSC Liftover](https://genome.ucsc.edu/cgi-bin/hgLiftOver) to convert.

### Base resource allocation updaters
To optionally update the base resource (cpus or memory) allocations for processes, use the following structure and add the necessary parts to the [input.config](config/template.config) file. The default allocations can be found in the [node-specific config files](./config/)

```Nextflow
base_resource_update {
memory = [
[['process_name', 'process_name2'], <multiplier for resource>],
[['process_name3', 'process_name4'], <different multiplier for resource>]
]
cpus = [
[['process_name', 'process_name2'], <multiplier for resource>],
[['process_name3', 'process_name4'], <different multiplier for resource>]
]
}
```
> **Note** Resource updates will be applied in the order they're provided so if a process is included twice in the memory list, it will be updated twice in the order it's given.

Examples:

- To double memory of all processes:
```Nextflow
base_resource_update {
memory = [
[[], 2]
]
}
```
- To double memory for `call_sSNV_Mutect2` and triple memory for `run_validate_PipeVal` and `run_sump_MuSE`:
```Nextflow
base_resource_update {
memory = [
['call_sSNV_Mutect2', 2],
[['run_validate_PipeVal', 'run_sump_MuSE'], 3]
]
}
```
- To double CPUs and memory for `run_sump_MuSE` and double memory for `run_validate_PipeVal`:
```Nextflow
base_resource_update {
cpus = [
['run_sump_MuSE', 2]
]
memory = [
[['run_sump_MuSE', 'run_validate_PipeVal'], 2]
]
}
```

#### Module Specific Configuration
| Input | Required | Type | Description |
|-------------|----|--------|-------------------------------------------|
Expand All @@ -213,7 +262,13 @@ input:
#### MuSE Specific Configuration
| Input | Required | Type | Description |
|-------------|----|--------|-------------------------------------------|
| dbSNP | yes | path | The path to dbSNP database's `*.vcf.gz` |
| dbSNP | yes | path | The path to [NCBI's dbSNP database](https://www.ncbi.nlm.nih.gov/snp/) of known SNPs in VCF format, e.g. `GCF_000001405.40.gz` |

#### Variant Intersection Specific Configuration
| Input | Required | Type | Description |
|-------------|----|--------|-------------------------------------------|
| ncbi_build | yes | string | vcf2maf requires the reference genome build ID, e.g. GRCh38 |
| vcf2maf_extra_args | no | string | additional arguments for the vcf2maf command|

## Outputs
| Tool Outputs | Type | Description |
Expand Down
76 changes: 74 additions & 2 deletions config/custom_schema_types.config
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,12 @@
*/
custom_schema_types {
allowed_sample_types = [
'tumor', 'normal'
'normal',
'tumor'
]
allowed_resource_types = [
'memory',
'cpus'
]

/**
Expand All @@ -16,6 +21,14 @@ custom_schema_types {
}
}
}

/**
* Check if input is a String or GString
*/
is_string = { val ->
return (val in String || val in GString)
}

/**
* Check if given input is a Namespace
*/
Expand All @@ -24,6 +37,7 @@ custom_schema_types {
throw new Exception("${name} should be a Namespace, not ${val.getClass()}.")
}
}

/**
* Check if given input is a list
*/
Expand All @@ -32,6 +46,33 @@ custom_schema_types {
throw new Exception("${name} should be a List, not ${val.getClass()}.")
}
}

/**
* Check if given input is a number
*/

check_if_number = { val, String name ->
if (!(val in Integer || val in Float)) {
throw new Exception("${name} should be an Integer or Float, not ${val.getClass()}")
}
}
/**
* Check if given input is valid process list
*/
check_if_process_list = { val, String name ->
if (custom_schema_types.is_string(val)) {
if (val.isEmpty()) {
throw new Exception("Empty string specified for ${name}. Please provide valid input.")
}
} else {
try {
custom_schema_types.check_if_list(val, name)
} catch(Exception e) {
throw new Exception("${name} should be either a string or a list. Please provide valid input.")
}
}
}

/**
* Check that input is namespace of expected types
*/
Expand All @@ -48,6 +89,24 @@ custom_schema_types {
}
}

/**
* Check namespace for resource updates
*/
check_resource_update_namespace = { Map options, String name, Map properties ->
custom_schema_types.check_if_namespace(options[name], name)
def given_keys = options[name].keySet() as ArrayList
if (given_keys.size() <= 0) {
return
}
custom_schema_types.check_sample_type_keys(given_keys, name, custom_schema_types.allowed_resource_types)

options[name].each { entry ->
def entry_as_map = [:]
entry_as_map[entry.key] = entry.value
schema.validate_parameter(entry_as_map, entry.key, properties.elements[entry.key])
}
}

/**
* Check if proper BAM entry list
*/
Expand All @@ -61,8 +120,21 @@ custom_schema_types {
}
}

/**
* Check list of resource updates
*/
check_resource_update_list = { Map options, String name, Map properties ->
custom_schema_types.check_if_list(options[name], name)
for (item in options[name]) {
custom_schema_types.check_if_process_list(item[0], name)
custom_schema_types.check_if_number(item[1], name)
}
}

types = [
'InputNamespace': custom_schema_types.check_input_namespace,
'BAMEntryList': custom_schema_types.check_bam_list
'BAMEntryList': custom_schema_types.check_bam_list,
'ResourceUpdateNamespace': custom_schema_types.check_resource_update_namespace,
'ResourceUpdateList': custom_schema_types.check_resource_update_list
]
}
13 changes: 13 additions & 0 deletions config/methods.config
Original file line number Diff line number Diff line change
Expand Up @@ -168,12 +168,25 @@ methods {
}
}
}
modify_base_allocations = {
if (!(params.containsKey('base_resource_update') && params.base_resource_update)) {
return
}

params.base_resource_update.each { resource, updates ->
updates.each { processes, multiplier ->
def processes_to_update = (custom_schema_types.is_string(processes)) ? [processes] : processes
methods.update_base_resource_allocation(resource, multiplier, processes_to_update)
}
}
}

setup = {
schema.load_custom_types("${projectDir}/config/custom_schema_types.config")
schema.validate()
methods.set_process()
methods.set_resources_allocation()
methods.modify_base_allocations()
retry.setup_retry()
methods.set_env()
methods.set_sample_params()
Expand Down
13 changes: 13 additions & 0 deletions config/schema.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,19 @@ patient_id:
type: 'String'
required: true
help: 'Patient identifier'
base_resource_update:
type: 'ResourceUpdateNamespace'
required: false
help: 'User-defined modifications for adjusting base resource allocations for processes'
elements:
memory:
type: 'ResourceUpdateList'
required: false
help: 'List of memory updates'
cpus:
type: 'ResourceUpdateList'
required: false
help: 'List of CPU updates'
input:
type: 'InputNamespace'
required: true
Expand Down
12 changes: 8 additions & 4 deletions config/template.config
Original file line number Diff line number Diff line change
Expand Up @@ -17,16 +17,16 @@ params {
output_dir = ''
work_dir = ''
dataset_id = ''
// set params.exome to TRUE will add the '--exome' option when running manta and strelka2
// set params.exome to TRUE will add the '--exome' option when running Manta and Strelka2
// set params.exome to TRUE will add the '-E' option when running MuSE
exome = false
save_intermediate_files = false

// module options
// Module specific options
bgzip_extra_args = ''
tabix_extra_args = ''

// mutect2 options
// Mutect2 options
split_intervals_extra_args = ''
mutect2_extra_args = ''
filter_mutect_calls_extra_args = ''
Expand All @@ -37,9 +37,13 @@ params {
// MuSE options
dbSNP = '/hot/ref/database/dbSNP-155/original/GRCh38/GCF_000001405.39.gz'

// Intersect options
// Variant Intersection options
ncbi_build = 'GRCh38'
vcf2maf_extra_args = ''

// Base resource allocation updater
// See README for adding parameters to update the base resource allocations
}

// Setup the pipeline config. DO NOT REMOVE THIS LINE!
methods.setup()
5 changes: 5 additions & 0 deletions test/config/a_mini-all-tools.config
Original file line number Diff line number Diff line change
Expand Up @@ -37,5 +37,10 @@ params {
// Intersect options
ncbi_build = 'GRCh38'
vcf2maf_extra_args = ''

// Base resource allocation updater
// See README for adding parameters to update the base resource allocations
}

// Setup the pipeline config. DO NOT REMOVE THIS LINE!
methods.setup()
5 changes: 4 additions & 1 deletion test/config/a_mini-muse.config
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ params {
mutect2_extra_args = ''
filter_mutect_calls_extra_args = ''
gatk_command_mem_diff = 500.MB
scatter_count = 12
scatter_count = 50
germline_resource_gnomad_vcf = '/hot/ref/tool-specific-input/GATK/GRCh38/af-only-gnomad.hg38.vcf.gz'

// MuSE options
Expand All @@ -39,6 +39,9 @@ params {
ncbi_build = 'GRCh38'
vcf2maf_extra_args = ''

// Base resource allocation updater
// See README for adding parameters to update the base resource allocations
}

// Setup the pipeline config. DO NOT REMOVE THIS LINE!
methods.setup()
4 changes: 4 additions & 0 deletions test/config/a_mini-mutect2.config
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,10 @@ params {
// Intersect options
ncbi_build = 'GRCh38'
vcf2maf_extra_args = ''

// Base resource allocation updater
// See README for adding parameters to update the base resource allocations
}

// Setup the pipeline config. DO NOT REMOVE THIS LINE!
methods.setup()
6 changes: 5 additions & 1 deletion test/config/a_mini-somaticsniper.config
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ params {
mutect2_extra_args = ''
filter_mutect_calls_extra_args = ''
gatk_command_mem_diff = 500.MB
scatter_count = 12
scatter_count = 50
germline_resource_gnomad_vcf = '/hot/ref/tool-specific-input/GATK/GRCh38/af-only-gnomad.hg38.vcf.gz'

// MuSE options
Expand All @@ -38,6 +38,10 @@ params {
// Intersect options
ncbi_build = 'GRCh38'
vcf2maf_extra_args = ''

// Base resource allocation updater
// See README for adding parameters to update the base resource allocations
}

// Setup the pipeline config. DO NOT REMOVE THIS LINE!
methods.setup()
6 changes: 5 additions & 1 deletion test/config/a_mini-strelka2.config
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ params {
mutect2_extra_args = ''
filter_mutect_calls_extra_args = ''
gatk_command_mem_diff = 500.MB
scatter_count = 12
scatter_count = 50
germline_resource_gnomad_vcf = '/hot/ref/tool-specific-input/GATK/GRCh38/af-only-gnomad.hg38.vcf.gz'

// MuSE options
Expand All @@ -38,6 +38,10 @@ params {
// Intersect options
ncbi_build = 'GRCh38'
vcf2maf_extra_args = ''

// Base resource allocation updater
// See README for adding parameters to update the base resource allocations
}

// Setup the pipeline config. DO NOT REMOVE THIS LINE!
methods.setup()
2 changes: 1 addition & 1 deletion test/config/a_mini-two-tools.config
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ params {
mutect2_extra_args = ''
filter_mutect_calls_extra_args = ''
gatk_command_mem_diff = 500.MB
scatter_count = 12
scatter_count = 50
germline_resource_gnomad_vcf = '/hot/ref/tool-specific-input/GATK/GRCh38/af-only-gnomad.hg38.vcf.gz'

// MuSE options
Expand Down

0 comments on commit ad29da1

Please sign in to comment.