Skip to content

An R package for performing association analysis of whole-genome/whole-exome sequencing (WGS/WES) studies using STAARpipeline

License

Notifications You must be signed in to change notification settings

xihaoli/STAARpipeline

Repository files navigation

R build status Build status License: GPL v3

STAARpipeline

This is an R package for performing association analysis of whole-genome/whole-exome sequencing (WGS/WES) studies using STAARpipeline.

Description

STAARpipeline is an R package for phenotype-genotype association analyses of biobank-scale WGS/WES data, including single variant analysis and variant set analysis. The single variant analysis in STAARpipeline provides individual P values of variants given an MAF or MAC cut-off. The variant set analysis in STAARpipeline includes gene-centric analysis and non-gene-centric analysis of rare variants. The gene-centric coding analysis provides seven genetic categories: putative loss of function (pLoF), protein-truncating (ptv), missense, disruptive missense, pLoF and disruptive missense, ptv and disruptive missense, and synonymous. The gene-centric noncoding analysis provides eight genetic categories: promoter or enhancer overlaid with CAGE or DHS sites, UTR, upstream, downstream, and noncoding RNA genes. The non-gene-centric analysis includes sliding window analysis with fixed sizes and dynamic window analysis with data-adaptive sizes. STAARpipeline accounts for population structure and relatedness, and is computationally scalable for analyzing biobank-scale WGS/WES studies of continuous and dichotomous traits with balanced or imbalanced case-control ratios, as well as multiple correlated traits jointly. STAARpipeline also provides analytical follow-up of dissecting association signals independent of known variants via conditional analysis using STAARpipelineSummary.

STAARpipeline and STAARpipelineSummary are implemented as a collection of apps. Please see the apps staarpipeline, staarpipelinesummary_varset and staarpipelinesummary_indvar that run on the UK Biobank Research Analysis Platform for more details.

Workflow Overview

STAARpipeline_workflow

Prerequisites

R (recommended version >= 3.5.1)

For optimal computational performance, it is recommended to use an R version configured with the Intel Math Kernel Library (or other fast BLAS/LAPACK libraries). See the instructions on building R with Intel MKL.

Dependencies

STAARpipeline links to R packages Rcpp and RcppArmadillo, and also imports R packages Rcpp, STAAR, MultiSTAAR, SCANG, dplyr, SeqArray, SeqVarTools, GenomicFeatures, TxDb.Hsapiens.UCSC.hg38.knownGene, GMMAT, GENESIS, Matrix. These dependencies should be installed before installing STAARpipeline.

Installation

library(devtools)
devtools::install_github("xihaoli/STAARpipeline",ref="main")

Docker Image

A docker image for STAARpipeline, including R (version 3.6.1) built with Intel MKL and all STAAR-related packages (STAAR, MultiSTAAR, SCANG, STAARpipeline, STAARpipelineSummary) pre-installed, is located in the Docker Hub. The docker image can be pulled using

docker pull zilinli/staarpipeline:0.9.7

Usage

Please see the STAARpipeline user manual for detailed usage of STAARpipeline package. Please see the STAARpipeline tutorial for a detailed example of analyzing sequencing data using STAARpipeline.

Data Availability

The whole-genome functional annotation data assembled from a variety of sources and the precomputed annotation principal components are available at the Functional Annotation of Variant - Online Resource (FAVOR) site and FAVOR Essential Database.

Version

The current version is 0.9.7.2 (November 17, 2024).

Citation

If you use STAARpipeline and STAARpipelineSummary for your work, please cite:

Zilin Li*, Xihao Li*, Hufeng Zhou, Sheila M. Gaynor, Margaret Sunitha Selvaraj, Theodore Arapoglou, Corbin Quick, Yaowu Liu, Han Chen, Ryan Sun, Rounak Dey, Donna K. Arnett, Paul L. Auer, Lawrence F. Bielak, Joshua C. Bis, Thomas W. Blackwell, John Blangero, Eric Boerwinkle, Donald W. Bowden, Jennifer A. Brody, Brian E. Cade, Matthew P. Conomos, Adolfo Correa, L. Adrienne Cupples, Joanne E. Curran, Paul S. de Vries, Ravindranath Duggirala, Nora Franceschini, Barry I. Freedman, Harald H. H. Göring, Xiuqing Guo, Rita R. Kalyani, Charles Kooperberg, Brian G. Kral, Leslie A. Lange, Bridget M. Lin, Ani Manichaikul, Alisa K. Manning, Lisa W. Martin, Rasika A. Mathias, James B. Meigs, Braxton D. Mitchell, May E. Montasser, Alanna C. Morrison, Take Naseri, Jeffrey R. O’Connell, Nicholette D. Palmer, Patricia A. Peyser, Bruce M. Psaty, Laura M. Raffield, Susan Redline, Alexander P. Reiner, Muagututi’a Sefuiva Reupena, Kenneth M. Rice, Stephen S. Rich, Jennifer A. Smith, Kent D. Taylor, Margaret A. Taub, Ramachandran S. Vasan, Daniel E. Weeks, James G. Wilson, Lisa R. Yanek, Wei Zhao, NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium, TOPMed Lipids Working Group, Jerome I. Rotter, Cristen J. Willer, Pradeep Natarajan, Gina M. Peloso, & Xihong Lin. (2022). A framework for detecting noncoding rare variant associations of large-scale whole-genome sequencing studies. Nature Methods, 19(12), 1599-1611. PMID: 36303018. PMCID: PMC10008172. DOI: 10.1038/s41592-022-01640-x.

Xihao Li*, Zilin Li*, Hufeng Zhou, Sheila M. Gaynor, Yaowu Liu, Han Chen, Ryan Sun, Rounak Dey, Donna K. Arnett, Stella Aslibekyan, Christie M. Ballantyne, Lawrence F. Bielak, John Blangero, Eric Boerwinkle, Donald W. Bowden, Jai G. Broome, Matthew P. Conomos, Adolfo Correa, L. Adrienne Cupples, Joanne E. Curran, Barry I. Freedman, Xiuqing Guo, George Hindy, Marguerite R. Irvin, Sharon L. R. Kardia, Sekar Kathiresan, Alyna T. Khan, Charles L. Kooperberg, Cathy C. Laurie, X. Shirley Liu, Michael C. Mahaney, Ani W. Manichaikul, Lisa W. Martin, Rasika A. Mathias, Stephen T. McGarvey, Braxton D. Mitchell, May E. Montasser, Jill E. Moore, Alanna C. Morrison, Jeffrey R. O'Connell, Nicholette D. Palmer, Akhil Pampana, Juan M. Peralta, Patricia A. Peyser, Bruce M. Psaty, Susan Redline, Kenneth M. Rice, Stephen S. Rich, Jennifer A. Smith, Hemant K. Tiwari, Michael Y. Tsai, Ramachandran S. Vasan, Fei Fei Wang, Daniel E. Weeks, Zhiping Weng, James G. Wilson, Lisa R. Yanek, NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium, TOPMed Lipids Working Group, Benjamin M. Neale, Shamil R. Sunyaev, Gonçalo R. Abecasis, Jerome I. Rotter, Cristen J. Willer, Gina M. Peloso, Pradeep Natarajan, & Xihong Lin. (2020). Dynamic incorporation of multiple in silico functional annotations empowers rare variant association analysis of large whole-genome sequencing studies at scale. Nature Genetics, 52(9), 969-983. PMID: 32839606. PMCID: PMC7483769. DOI: 10.1038/s41588-020-0676-4.

License

This software is licensed under GPLv3.

GPLv3 GNU General Public License, GPLv3