FlashPCA performs fast principal component analysis (PCA) of single nucleotide polymorphism (SNP) data, similar to smartpca from EIGENSOFT (http://www.hsph.harvard.edu/alkes-price/software/) and shellfish (https://github.com/dandavison/shellfish). FlashPCA is based on the https://github.com/yixuan/spectra/ library.
Main features:
- Fast: partial PCA (k=20 dimensions) of 500,000 individuals with 100,000 SNPs in <6h using 2GB RAM
- Scalable: memory requirements are bounded, scales to at least 1M individuals
- Highly accurate results
- Natively reads PLINK bed/bim/fam files
- Easy to use; can be called entirely within R (package flashpcaR)
Google Groups: https://groups.google.com/forum/#!forum/flashpca-users
Gad Abraham, gad.abraham@baker.edu.au
version ≥2: G. Abraham, Y. Qiu, and M. Inouye, ``FlashPCA2: principal component analysis of biobank-scale genotype datasets'', (2017) Bioinformatics 33(17): 2776-2778. doi:10.1093/bioinformatics/btx299 (bioRxiv preprint https://doi.org/10.1101/094714)
version ≤1.2.6: G. Abraham and M. Inouye, ``Fast Principal Component Analysis of Large-Scale Genome-Wide Data'', (2016) PLOS ONE 9(4): e93766. doi:10.1371/journal.pone.0093766
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.
Copyright (C) 2014-2020 Gad Abraham. All rights reserved.
Portions of this code are based on SparSNP (https://github.com/gabraham/SparSNP), Copyright (C) 2011-2012 Gad Abraham and National ICT Australia (http://www.nicta.com.au).
- We recommend compiling from source for best performance.
- To get the devel version, you'll need to compile yourself
See Releases for statically-linked version for Linux x86-64 ≥ 2.6.15
- 64-bit Linux or Mac
To get the latest version:
git clone git://github.com/gabraham/flashpca
On Linux:
- 64-bit OS
- g++ compiler
- Eigen (http://eigen.tuxfamily.org), v3.2 or higher
(if you get a compile error
error: no match for 'operator/' in '1 / ((Eigen::MatrixBase...
you'll need a more recent Eigen) - Spectra (https://github.com/yixuan/spectra/)
- Boost (http://www.boost.org/), specifically boost_program_options/boost_program_options-mt.
- libgomp for openmp support
- Recommended: plink2 (https://www.cog-genomics.org/plink2) for SNP thinning
On Mac:
- Homebrew (http://brew.sh) to install boost
- Eigen, as above
- Spectra, as above
- clang C++ compiler
The Makefile contains three variables that need to be set according to where you have installed the Eigen headers and Boost headers and libraries on your system. The default values for these are:
EIGEN_INC=/usr/local/include/eigen
BOOST_INC=/usr/local/include/boost
BOOST_LIB=/usr/local/lib
SPECTRA_INC=spectra
If your system has these libraries and header files in those locations, you can simply run make:
cd flashpca
make all
If not, you can override their values on the make command line. For example,
if you have the Eigen source in /opt/eigen-3.2.5
, spectra headers in
/opt/spectra
, and Boost 1.59.0 installed into /opt/boost-1.59.0
, you could run:
cd flashpca
make all EIGEN_INC=/opt/eigen-3.2.5 \
BOOST_INC=/opt/boost-1.59.0/include \
BOOST_LIB=/opt/boost-1.59.0/lib \
SPECTRA_INC=/opt/spectra
Example of how to run flashpca in a Docker container
First thin the data by LD (highly recommend plink2 for this):
plink --bfile data --indep-pairwise 1000 50 0.05 --exclude range exclusion_regions_hg19.txt
plink --bfile data --extract plink.prune.in --make-bed --out data_pruned
where exclusion_regions_hg19.txt contains:
5 44000000 51500000 r1
6 25000000 33500000 r2
8 8000000 12000000 r3
11 45000000 57000000 r4
(You may need to change the --indep-pairwise parameters to get a suitable number of SNPs for you dataset, 10,000-50,000 is usually enough.)
To run on the pruned dataset:
./flashpca --bfile data_pruned
To append a custom suffix '_mysuffix.txt' to all output files:
./flashpca --suffix _mysuffix.txt ...
To see all options
./flashpca --help
By default, flashpca produces the following files:
eigenvectors.txt
: the top k eigenvectors of the covariance X XT / p, same as matrix U from the SVD of the genotype matrix X/sqrt(p)=UDVT (where p is the number of SNPs).pcs.txt
: the top k principal components (the projection of the data on the eigenvectors, scaled by the eigenvalues, same as XV (or UD). This is the file you will want to plot the PCA plot from.eigenvalues.txt
: the top k eigenvalues of X XT / p. These are the square of the singular values D (square of sdev from prcomp).pve.txt
: the proportion of total variance explained by each of the top k eigenvectors (the total variance is given by the trace of the covariance matrix X XT / p, which is the same as the sum of all eigenvalues). To get the cumulative variance explained, simply do the cumulative sum of the variances (cumsum
in R).
You must perform quality control using PLINK (at least filter using --geno, --mind, --maf, --hwe) before running flashpca on your data. You will likely get spurious results otherwise.
flashpca can project new samples onto existing principal components:
./flashpca --bfile newdata --project --inmeansd meansd.txt \
--outproj projections.txt --inload loadings.txt -v
To project data, you must ensure:
- The old and new PLINK files contain exactly the same SNPs and alleles (you
can use
plink --reference-allele ...
to ensure consistent allele ordering). - You have previously run flashpca and saved the SNP loadings
(
--outload loadings.txt
) and their means and standard deviations (--outmeansd meansd.txt
). - You are using the same standardisation (
--standx
) for the old and new data, as well as the same divisor (--div
; by defaultp
).
flashpca can check how accurate a decomposition is, where accuracy is defined as || X XT / p - U D2 ||F2 / (n × k).
This is done using
./flashpca --bfile data --check \
--outvec eigenvectors.txt --outval eigenvalues.txt
The final mean squared error should be low (e.g., <1e-8).
Unlike EIGENSOFT/smartpca, flashpca does not remove outliers automatically
(numoutlieriter
in EIGENSOFT). We recommend inspecting the PCA plot
manually, and if you wish to remove outliers and repeat PCA on the remaining
samples, use plink --remove
to create a new bed/bim/fam fileset and run
flashpca on the new data.
- flashpca now supports sparse CCA (Parkhomenko 2009, Witten 2009), between SNPs and multivariate phenotypes.
- The phenotype file is the same as PLINK phenotype file:
FID, IID, pheno1, pheno2, pheno3, ...
except that there must be no header line. The phenotype file must be in the same order as the FAM file. - The L1 penalty for the SNPs is
--lambda1
and for the phenotypes is--lambda2
.
./flashpca --scca --bfile data --pheno pheno.txt \
--lambda1 1e-3 --lambda2 1e-2 --ndim 10 --numthreads 8
- The file eigenvectorsX.txt are the left eigenvectors of XT Y, with size (number of SNPs × number of dimensions), and eigenvectorsY.txt are the right eigenvectors of XT Y, with size (number of phenotypes &\times; number of dimensions).
We optimise the penalties by finding the values that maximise the correlation of the canonical components cor(X U, Y V) in independent test data.
- Wrapper script scca.sh (GNU parallel is recommended)
- R code for plotting the correlations scca_pred.R
FlashPCA can be called (almost) entirely within R.
-
See Releases for pre-packaged flashpcaR packages for Windows, Mac, and Linux.
-
Compiling from source
devtools::install_github("gabraham/flashpca/flashpcaR")
This may fail on Windows due to symbolic link issues; in that case use
git clone https://github.com/gabraham/flashpca cd flashpca R CMD build flashpcaR R CMD INSTALL flashpcaR_2.0.tar.gz
Depends on packages:
Suggests:
data(hm3.chr1)
X <- scale2(hm3.chr1$bed)
dim(X)
f <- flashpca(X, ndim=10, scale="none")
You can supply a path to a PLINK dataset (with extensions .bed/.bim/.fam, all lowercase):
fn <- gsub("\\.bed", "",
system.file("extdata", "data_chr1.bed", package="flashpcaR"))
fn
f <- flashpca(fn, ndim=10)
Use HapMap3 genotypes, standardise them, simulate some phenotypes, and test each SNP for association with all phenotypes:
data(hm3.chr1)
X <- scale2(hm3.chr1$bed)
k <- 10
B <- matrix(rnorm(ncol(X) * k), ncol=k)
Y <- X %*% B + rnorm(nrow(X) * k)
f1 <- ucca(X, Y, standx="none", standy="sd")
head(f1$result)
fn <- gsub("\\.bed", "",
system.file("extdata", "data_chr1.bed", package="flashpcaR"))
fn
f2 <- ucca(fn, Y, standx="binom2", standy="sd")
head(f2$result)
Use HapMap3 genotypes, standardise them, simulate some phenotypes, and run sparse canonical correlation analysis over all SNPs and all phenotypes:
data(hm3.chr1)
X <- scale2(hm3.chr1$bed)
k <- 10
B <- matrix(rnorm(ncol(X) * k), ncol=k)
Y <- X %*% B + rnorm(nrow(X) * k)
f1 <- scca(X, Y, standx="none", standy="sd", lambda1=1e-2, lambda2=1e-3)
diag(cor(f1$Px, f1$Py))
# 3-fold cross-validation
cv1 <- cv.scca(X, Y, standx="sd", standy="sd",
lambda1=seq(1e-3, 1e-1, length=10), lambda2=seq(1e-6, 1e-3, length=5),
ndim=3, nfolds=3)
# Plot the canonical correlations over the penalties, for the 1st dimension
plot(cv1, dim=1)
fn <- gsub("\\.bed", "",
system.file("extdata", "data_chr1.bed", package="flashpcaR"))
fn
f2 <- scca(fn, Y, standx="binom2", standy="sd", lambda1=1e-2, lambda2=1e-3)
diag(cor(f2$Px, f2$Py))
# Cross-validation isn't yet supported for PLINK data
See the HapMap3 directory
See CHANGELOG.txt