spPGS is a bioinformatics pipeline for calculating Polygenic Scores (PGS) from PDG catalog on a set of samples so that they can be compared to the Spanish PGS reference distributions. The pipeline was designed to fix common VCF malformations, impute missing values and eventually generate a sample sheet to feed the pgsc_calc tool.
The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies.
- Software dependencies
- Nextflow >= 22.10.6
- Singularity or Docker containerization software
- Resources
- Reference genome: we recommend hs37d5, that includes data from GRCh37, the rCRS mitochondrial sequence, Human herpesvirus 4 type 1 and the concatenated decoy sequences.
- Shapeit4 genetic maps: maps for b37 can be downloaded directly from shapeit4 github repository
- Mimimac4 reference panel: we recommend 1000 Genomes Phase 3 reference panel
- Samples
- For an accurate phasing prediction step, it is recommended to analyze 20 or more samples at the same time.
- VCFs are expected to be in hg19 or GRCh37. GRCh38 is not supported yet. If needed, picard liftover tool can be used to remap VCFs.
nextflow.config included in this workflow has profiles for singularity and docker, and can be executed locally or in a standard Slurm HPC. Other executor engines like SGE, AWS, Google cloud or kubernetes can easily be used by doing some minor changes to the nextflow.config. Although it is not mandatory, we recommend you to create a profile that fit the specific requirements of your institution. See nextflow executors for additional information.
Firstly, the sample-sheet and the imputed VCFs files have to be generated. You can do this for a set of genomes in a slurm environment by running the following command:
$ nextflow run babelomics/spPGS \
-profile slurm,singularity \
-resume \
--input_folder input_folder \
--exome 'false' \
--shapeit4_map_files '/resources/shapeit4/chr*.b37.gmap.gz' \
--minimac4_ref_files '/resources/minimac4/*.1000g.Phase3.v5.With.Parameter.Estimates.m3vcf.gz' \
--reference_genome '/resources/ref/hs37d5.fa'
To see the full pipeline parameters list use:
$ nextflow run babelomics/spPGS --help
Note that -profile and -resume are nextflow parameters and therefore are preceded by a single hyphens. To run the pipeline locally on a docker environment use -profile docker
. Custom configuration files can be included with -config nextflow.config
option. To see additional nextflow parameters use nextflow run -help
.
Once the sample sheet has been generated, you can calculate any polygenic score from PDG catalog by running:
$ nextflow run pgscatalog/pgsc_calc \
-profile docker \
-resume \
--input results/psc_calc_samplesheet.txt \
--target_build GRCh37 \
--pgs_id PGS000021
Your results, could then be compared with the Spanish PGS reference distribution.
A manuscript describing the tool is in preparation. In the meantime if you use the tool we ask you to cite the repo and the paper describing the CSVS resource:
- Peña-Chilet M et al. CSVS, a crowdsourcing database of the Spanish population genetic variability. Nucleic Acids Res. 2021 Jan 8;49(D1):D1130-D1137. doi: 10.1093/nar/gkaa794.