Merge pull request #225 from uclahs-cds/sfitz-update-resource-allocation

Sfitz update resource allocation
uclahs-cds · Oct 23, 2023 · ad29da1 · ad29da1
2 parents 51e870d + a39c887
commit ad29da1
Show file tree

Hide file tree

Showing 12 changed files with 198 additions and 20 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -31,6 +31,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - Update plot-venn.R to work with all numbers of algorithms greater than two
 
 ### Added
+- Custom resource allocation updates through configuration parameters
 - Add assertions to `nftest`
 - Add compression of `SomaticSniper` `bam-readcount` output and move to `intermediate` directory
 - Add `ncbi_build` parameter

diff --git a/README.md b/README.md
@@ -182,18 +182,67 @@ input:
 ### input.config ([see template](config/template.config))
 | Input | Required | Type   | Description                               |
 |--------|---|--------|-------------------------------------------|
-| algorithm   | yes | list   | List containing a combination of somaticsniper, strelka2, mutect2 and muse |
-| reference   | yes | string | The reference .fa file (.fai and .dict file must exist in same directory) |
-| intersect_regions* | yes | string | A bed file listing the genomic regions for variant calling. Excluding `decoy` regions is HIGHLY recommended *
-| output_dir  | yes | string | The location where outputs will be saved  |
-| dataset_id | yes | string | The name/ID of the dataset    |
-| exome       | yes | boolean | The option will be used by `Strelka2` and `MuSE`. When `true`, it will add the `--exome` option  to Manta and Strelka2, and `-E` option to MuSE |
-| save_intermediate_files | yes | boolean | Whether to save intermediate files |
-| work_dir | no | string | The path of working directory for Nextflow, storing intermediate files and logs. The default is `/scratch` with `ucla_cds` and should only be changed for testing/development. Changing this directory to `/hot` or `/tmp` can lead to high server latency and potential disk space limitations, respectively |
-| docker_container_registry | no | string | Registry containing tool Docker images, optional. Default: `ghcr.io/uclahs-cds` |
+| `algorithm`   | yes | list   | List containing a combination of somaticsniper, strelka2, mutect2 and muse |
+| `reference`   | yes | string | The reference .fa file (.fai and .dict file must exist in same directory) |
+| `intersect_regions`* | yes | string | A bed file listing the genomic regions for variant calling. Excluding `decoy` regions is HIGHLY recommended *
+| `output_dir`  | yes | string | The location where outputs will be saved  |
+| `dataset_id` | yes | string | The name/ID of the dataset    |
+| `exome`       | yes | boolean | The option will be used by `Strelka2` and `MuSE`. When `true`, it will add the `--exome` option  to Manta and Strelka2, and `-E` option to MuSE |
+| `save_intermediate_files` | yes | boolean | Whether to save intermediate files |
+| `work_dir` | no | string | The path of working directory for Nextflow, storing intermediate files and logs. The default is `/scratch` with `ucla_cds` and should only be changed for testing/development. Changing this directory to `/hot` or `/tmp` can lead to high server latency and potential disk space limitations, respectively |
+| `docker_container_registry` | no | string | Registry containing tool Docker images, optional. Default: `ghcr.io/uclahs-cds` |
+| `base_resource_update` | optional | namespace | Namespace of parameters to update base resource allocations in the pipeline. Usage and structure are detailed in `template.config` and below. |
 
  *Providing `intersect_regions` is required and will limit the final output to just those regions.  All regions of the reference genome could be provided as a `bed` file with all contigs, however it is HIGHLY recommended to remove `decoy` contigs from the human reference genome. Including these thousands of small contigs will require the user to increase available memory for `Mutect2` and will cause a very long runtime for `Strelka2`. See [Discussion here](https://github.com/uclahs-cds/pipeline-call-sSNV/discussions/216). A GRCh38 `bed.gz` file can be found here: `/hot/ref/tool-specific-input/pipeline-call-sSNV-6.0.0/GRCh38-BI-20160721/Homo_sapiens_assembly38_no-decoy.bed.gz`. For other genome versions, you may be able to use [UCSC Liftover](https://genome.ucsc.edu/cgi-bin/hgLiftOver) to convert.
 
+ ### Base resource allocation updaters
+To optionally update the base resource (cpus or memory) allocations for processes, use the following structure and add the necessary parts to the [input.config](config/template.config) file. The default allocations can be found in the [node-specific config files](./config/)
+
+```Nextflow
+base_resource_update {
+    memory = [
+        [['process_name', 'process_name2'], <multiplier for resource>],
+        [['process_name3', 'process_name4'], <different multiplier for resource>]
+    ]
+    cpus = [
+        [['process_name', 'process_name2'], <multiplier for resource>],
+        [['process_name3', 'process_name4'], <different multiplier for resource>]
+    ]
+}
+```
+> **Note** Resource updates will be applied in the order they're provided so if a process is included twice in the memory list, it will be updated twice in the order it's given.
+
+Examples:
+
+- To double memory of all processes:
+```Nextflow
+base_resource_update {
+    memory = [
+        [[], 2]
+    ]
+}
+```
+- To double memory for `call_sSNV_Mutect2` and triple memory for `run_validate_PipeVal` and `run_sump_MuSE`:
+```Nextflow
+base_resource_update {
+    memory = [
+        ['call_sSNV_Mutect2', 2],
+        [['run_validate_PipeVal', 'run_sump_MuSE'], 3]
+    ]
+}
+```
+- To double CPUs and memory for `run_sump_MuSE` and double memory for `run_validate_PipeVal`:
+```Nextflow
+base_resource_update {
+    cpus = [
+        ['run_sump_MuSE', 2]
+    ]
+    memory = [
+        [['run_sump_MuSE', 'run_validate_PipeVal'], 2]
+    ]
+}
+```
+
 #### Module Specific Configuration
 | Input       | Required | Type   | Description                               |
 |-------------|----|--------|-------------------------------------------|
@@ -213,7 +262,13 @@ input:
 #### MuSE Specific Configuration
 | Input       | Required | Type   | Description                               |
 |-------------|----|--------|-------------------------------------------|
-| dbSNP | yes | path | The path to dbSNP database's `*.vcf.gz` |
+| dbSNP | yes | path | The path to [NCBI's dbSNP database](https://www.ncbi.nlm.nih.gov/snp/) of known SNPs in VCF format, e.g. `GCF_000001405.40.gz` |
+
+#### Variant Intersection Specific Configuration
+| Input       | Required | Type   | Description                               |
+|-------------|----|--------|-------------------------------------------|
+| ncbi_build | yes | string | vcf2maf requires the reference genome build ID, e.g. GRCh38 |
+| vcf2maf_extra_args | no | string | additional arguments for the vcf2maf command|
 
 ## Outputs
 | Tool Outputs                                         | Type         | Description                   |

diff --git a/config/custom_schema_types.config b/config/custom_schema_types.config
@@ -3,7 +3,12 @@
 */
 custom_schema_types {
     allowed_sample_types = [
-        'tumor', 'normal'
+        'normal',
+        'tumor'
+    ]
+    allowed_resource_types = [
+        'memory',
+        'cpus'
     ]
 
     /**
@@ -16,6 +21,14 @@ custom_schema_types {
             }
         }
     }
+
+    /**
+    *   Check if input is a String or GString
+    */
+    is_string = { val ->
+        return (val in String || val in GString)
+    }
+
     /**
     * Check if given input is a Namespace
     */
@@ -24,6 +37,7 @@ custom_schema_types {
             throw new Exception("${name} should be a Namespace, not ${val.getClass()}.")
         }
     }
+
     /**
     * Check if given input is a list
     */
@@ -32,6 +46,33 @@ custom_schema_types {
             throw new Exception("${name} should be a List, not ${val.getClass()}.")
         }
     }
+
+    /**
+    *   Check if given input is a number
+    */
+
+    check_if_number = { val, String name ->
+        if (!(val in Integer || val in Float)) {
+            throw new Exception("${name} should be an Integer or Float, not ${val.getClass()}")
+        }
+    }
+    /**
+    *   Check if given input is valid process list
+    */
+    check_if_process_list = { val, String name ->
+        if (custom_schema_types.is_string(val)) {
+            if (val.isEmpty()) {
+                throw new Exception("Empty string specified for ${name}. Please provide valid input.")
+            }
+        } else {
+            try {
+                custom_schema_types.check_if_list(val, name)
+            } catch(Exception e) {
+                throw new Exception("${name} should be either a string or a list. Please provide valid input.")
+            }
+        }
+    }
+
     /**
     * Check that input is namespace of expected types
     */
@@ -48,6 +89,24 @@ custom_schema_types {
         }
     }
 
+    /**
+    *   Check namespace for resource updates
+    */
+    check_resource_update_namespace = { Map options, String name, Map properties ->
+        custom_schema_types.check_if_namespace(options[name], name)
+        def given_keys = options[name].keySet() as ArrayList
+        if (given_keys.size() <= 0) {
+            return
+        }
+        custom_schema_types.check_sample_type_keys(given_keys, name, custom_schema_types.allowed_resource_types)
+
+        options[name].each { entry ->
+            def entry_as_map = [:]
+            entry_as_map[entry.key] = entry.value
+            schema.validate_parameter(entry_as_map, entry.key, properties.elements[entry.key])
+        }
+    }
+
     /**
     * Check if proper BAM entry list
     */
@@ -61,8 +120,21 @@ custom_schema_types {
         }
     }
 
+    /**
+    *   Check list of resource updates
+    */
+    check_resource_update_list = { Map options, String name, Map properties ->
+        custom_schema_types.check_if_list(options[name], name)
+        for (item in options[name]) {
+            custom_schema_types.check_if_process_list(item[0], name)
+            custom_schema_types.check_if_number(item[1], name)
+        }
+    }
+
     types = [
         'InputNamespace': custom_schema_types.check_input_namespace,
-        'BAMEntryList': custom_schema_types.check_bam_list
+        'BAMEntryList': custom_schema_types.check_bam_list,
+        'ResourceUpdateNamespace': custom_schema_types.check_resource_update_namespace,
+        'ResourceUpdateList': custom_schema_types.check_resource_update_list
     ]
 }
diff --git a/config/methods.config b/config/methods.config
@@ -168,12 +168,25 @@ methods {
             }
         }
     }
+    modify_base_allocations = {
+        if (!(params.containsKey('base_resource_update') && params.base_resource_update)) {
+            return
+        }
+
+        params.base_resource_update.each { resource, updates ->
+            updates.each { processes, multiplier ->
+                def processes_to_update = (custom_schema_types.is_string(processes)) ? [processes] : processes
+                methods.update_base_resource_allocation(resource, multiplier, processes_to_update)
+            }
+        }
+    }
 
     setup = {
         schema.load_custom_types("${projectDir}/config/custom_schema_types.config")
         schema.validate()
         methods.set_process()
         methods.set_resources_allocation()
+        methods.modify_base_allocations()
         retry.setup_retry()
         methods.set_env()
         methods.set_sample_params()

diff --git a/config/schema.yaml b/config/schema.yaml
@@ -96,6 +96,19 @@ patient_id:
   type: 'String'
   required: true
   help: 'Patient identifier'
+base_resource_update:
+  type: 'ResourceUpdateNamespace'
+  required: false
+  help: 'User-defined modifications for adjusting base resource allocations for processes'
+  elements:
+    memory:
+      type: 'ResourceUpdateList'
+      required: false
+      help: 'List of memory updates'
+    cpus:
+      type: 'ResourceUpdateList'
+      required: false
+      help: 'List of CPU updates'
 input:
   type: 'InputNamespace'
   required: true

diff --git a/config/template.config b/config/template.config
@@ -17,16 +17,16 @@ params {
     output_dir = ''
     work_dir = ''
     dataset_id = ''
-    // set params.exome to TRUE will add the '--exome' option when running manta and strelka2
+    // set params.exome to TRUE will add the '--exome' option when running Manta and Strelka2
     // set params.exome to TRUE will add the '-E' option when running MuSE
     exome = false
     save_intermediate_files = false
 
-    // module options
+    // Module specific options
     bgzip_extra_args = ''
     tabix_extra_args = ''
 
-    // mutect2 options
+    // Mutect2 options
     split_intervals_extra_args = ''
     mutect2_extra_args = ''
     filter_mutect_calls_extra_args = ''
@@ -37,9 +37,13 @@ params {
     // MuSE options
     dbSNP = '/hot/ref/database/dbSNP-155/original/GRCh38/GCF_000001405.39.gz'
 
-    // Intersect options
+    // Variant Intersection options
     ncbi_build = 'GRCh38'
     vcf2maf_extra_args = ''
+
+    // Base resource allocation updater
+    // See README for adding parameters to update the base resource allocations    
 }
 
+// Setup the pipeline config. DO NOT REMOVE THIS LINE!
 methods.setup()
diff --git a/test/config/a_mini-all-tools.config b/test/config/a_mini-all-tools.config
@@ -37,5 +37,10 @@ params {
     // Intersect options
     ncbi_build = 'GRCh38'
     vcf2maf_extra_args = ''
+
+    // Base resource allocation updater
+    // See README for adding parameters to update the base resource allocations    
 }
+
+// Setup the pipeline config. DO NOT REMOVE THIS LINE!
 methods.setup()
diff --git a/test/config/a_mini-muse.config b/test/config/a_mini-muse.config
@@ -29,7 +29,7 @@ params {
     mutect2_extra_args = ''
     filter_mutect_calls_extra_args = ''
     gatk_command_mem_diff = 500.MB
-    scatter_count = 12
+    scatter_count = 50
     germline_resource_gnomad_vcf = '/hot/ref/tool-specific-input/GATK/GRCh38/af-only-gnomad.hg38.vcf.gz'
 
     // MuSE options
@@ -39,6 +39,9 @@ params {
     ncbi_build = 'GRCh38'
     vcf2maf_extra_args = ''
 
+    // Base resource allocation updater
+    // See README for adding parameters to update the base resource allocations
 }
 
+// Setup the pipeline config. DO NOT REMOVE THIS LINE!
 methods.setup()
diff --git a/test/config/a_mini-mutect2.config b/test/config/a_mini-mutect2.config
@@ -38,6 +38,10 @@ params {
     // Intersect options
     ncbi_build = 'GRCh38'
     vcf2maf_extra_args = ''
+
+    // Base resource allocation updater
+    // See README for adding parameters to update the base resource allocations    
 }
 
+// Setup the pipeline config. DO NOT REMOVE THIS LINE!
 methods.setup()
diff --git a/test/config/a_mini-somaticsniper.config b/test/config/a_mini-somaticsniper.config
@@ -29,7 +29,7 @@ params {
     mutect2_extra_args = ''
     filter_mutect_calls_extra_args = ''
     gatk_command_mem_diff = 500.MB
-    scatter_count = 12
+    scatter_count = 50
     germline_resource_gnomad_vcf = '/hot/ref/tool-specific-input/GATK/GRCh38/af-only-gnomad.hg38.vcf.gz'
 
     // MuSE options
@@ -38,6 +38,10 @@ params {
     // Intersect options
     ncbi_build = 'GRCh38'
     vcf2maf_extra_args = ''
+
+    // Base resource allocation updater
+    // See README for adding parameters to update the base resource allocations    
 }
 
+// Setup the pipeline config. DO NOT REMOVE THIS LINE!
 methods.setup()
diff --git a/test/config/a_mini-strelka2.config b/test/config/a_mini-strelka2.config
@@ -29,7 +29,7 @@ params {
     mutect2_extra_args = ''
     filter_mutect_calls_extra_args = ''
     gatk_command_mem_diff = 500.MB
-    scatter_count = 12
+    scatter_count = 50
     germline_resource_gnomad_vcf = '/hot/ref/tool-specific-input/GATK/GRCh38/af-only-gnomad.hg38.vcf.gz'
 
     // MuSE options
@@ -38,6 +38,10 @@ params {
     // Intersect options
     ncbi_build = 'GRCh38'
     vcf2maf_extra_args = ''
+
+    // Base resource allocation updater
+    // See README for adding parameters to update the base resource allocations    
 }
 
+// Setup the pipeline config. DO NOT REMOVE THIS LINE!
 methods.setup()
diff --git a/test/config/a_mini-two-tools.config b/test/config/a_mini-two-tools.config
@@ -28,7 +28,7 @@ params {
     mutect2_extra_args = ''
     filter_mutect_calls_extra_args = ''
     gatk_command_mem_diff = 500.MB
-    scatter_count = 12
+    scatter_count = 50
     germline_resource_gnomad_vcf = '/hot/ref/tool-specific-input/GATK/GRCh38/af-only-gnomad.hg38.vcf.gz'
 
     // MuSE options