-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Once the package has been installed with pip
, there are two command line entry points, indirect-GWAS
and a helper function, compute-feature-partial-covariance
.
Both of these commands have documentation available using the -h/--help
flag (e.g. indirect-GWAS -h
).
This is the main command. It requires the user to provide three key pieces of information:
- Projection coefficients
- GWAS summary statistics for the feature phenotypes
- Feature phenotype partial covariance matrix
The argument should refer to a delimited file (e.g. .csv
, .csv.gz
, .tsv
, etc.).
The rows/columns of this file should refer to features/projections.
If --no-index
is not included, the first row/column should be the names of the features/projections.
For example, the following is a valid file:
feature,projection_1,projection_2
feature_1,1,1
feature_2,2,-1
This argument takes one or more files giving GWAS summary statistics on the feature phenotypes. The names of these files should correspond to the feature names. These files must contain, at minimum, the following columns: a variant identifier, the coefficient estimates, and the standard errors. This argument must be paired with additional arguments that specify the relevant column names.
The GWAS summary statistics must also provide the degrees of freedom for these regressions. A few different options are available:
-
A constant value for every GWAS regression (only use this if the degrees of freedom for every feature x variant is the same) (
--degrees-of-freedom
) -
A column in the summary statistics files giving the degrees of freedom (
--degrees-of-freedom-column
) -
Through sample size (N) and number of covariates (K)
$\rightarrow$ (DF = N - K - 1).a. Sample size is either a constant for all (
--equal-sample-size
) or a column in the summary statistic files (--sample-size-column
)b. Number of covariates is either a constant for all (
--equal-number-of-covariates
) or a column in the summary statistic files (--number-of-covariates-column
)
For example, given the following file:
variant_id,beta,se,n_samples,n_covar
rs123456,2.1,1.0,10000,5
...
the following arguments would be appropriate
--variant-id-column variant_id
--coefficient-column beta
--standard-error-column se
--sample-size-column n_samples
--number-of-covariates-column n_covar
The covariance matrix of phenotypes that have been adjusted for covariates (features x features)
To remove the effects of covariates, this requires you to regress each feature phenotype on the covariates (phenotype ~ covariates), compute the predicted phenotypes given covariates (
The following is a valid file:
feature_id,feature_1,feature_2
feature_1,1.0,0.55
feature_2,0.55,0.999
This can be computed easily if individual-level data are available and all regressions are linear models using compute-feature-partial-covariance
or any statistical software.
For linear mixed models, use residuals if provided, or refer to the documentation/outputs of the specific method.
See information about how this works with SAIGE.
This is a simple helper script to compute the phenotype partial covariance matrix when individual-level data are available.
The phenotypes and covariates files can be specified using --pheno
and --covar
(optional).
If --covar
is not specified, it assumes the covariates are in the --pheno
file.
Sample IDs must be specified using --sample-id-column
, which defaults to #FID, IID
.
Names of phenotypes and covariates can be specified with --pheno-name
and --covar-name
.
When not provided, these default to all columns other than the sample IDs.
Note that either --covar
or --covar-name
must be specified.
An intercept can be added to the covariates with --add-intercept
.
This script has the same options as indirect-GWAS
, like --separator
, --float-format
, and --output
.
Overall, this package can be used in two simple steps:
- Load the data into an xarray (using one of the three
from_*
functions or manually) - Compute the indirect GWAS summary statistics using
indirect_GWAS.gwas_indirect
For example:
import indirect_GWAS
# Load data into Python
gwas_coef_df = pd.read_csv("gwas_coef.csv") # variants x features
gwas_stderr_df = pd.read_csv("gwas_stderr.csv") # variants x features
gwas_sample_size_df = pd.read_csv("gwas_sample_size.csv") # variants x features
feature_partial_cov = pd.read_csv("partial_cov.csv") # features x features
projection_coefficients = pd.DataFrame(
[[1], [-1], [5], [0.5]], index=gwas_coef_df.columns) # features x projections
# Load data into xarray
ds = indirect_GWAS.from_summary_statistics(
projection_coefficients=projection_coefficients,
feature_partial_covariance=feature_partial_cov,
feature_gwas_coefficients=gwas_coef_df,
feature_gwas_standard_error=gwas_stderr_df,
feature_gwas_sample_size=feature_gwas_sample_size,
n_covar=10, # The number included in feature GWAS (including intercept)
)
# Compute indirect GWAS summary statistics
beta, se, t_stat, p = indirect_GWAS.gwas_indirect(ds)
Indirect GWAS can be used with SAIGE only when the covariance of the Step 1 residuals is known or can be computed. See the SAIGE documentation for more information about this.
In short, if you run step1_fitNULLGLMM.R
, SAIGE produces an RData file with the suffix .rda
.
This file contains a variable, modglmm$residuals
, that contains the residuals needed to compute the phenotypic partial covariance matrix.
Here is a short snippet of R code that shows how you could compute the phenotypic partial covariance matrix given a directory of SAIGE step 1 output files.
library(tidyverse)
rda_files <- list.files(".", "*.rda", full.names = T)
full_df <- data.frame()
for (file in rda_files) {
env <- new.env()
load(file, envir = env)
full_df <- bind_rows(
full_df,
data.frame(IID = env$modglmm$sampleID, resid = env$modglmm$residuals,
pheno = str_replace(file, ".rda", ""))
)
}
wide_resid_df <- full_df %>%
pivot_wider(id_cols = IID, names_from = pheno, values_from = resid)
feature_partial_covariance_matrix <- wide_resid_df %>% cov()
For simplicity we'll use the mtcars
dataset here.
Suppose that we are interested in the projected phenotype hp - mpg
, with vs
as the genotype variable and gear
as an adjusted covariate.
The direct regression equivalent of this is the following:
import statsmodels.api as sm
# Load data
mtcars = sm.datasets.get_rdataset("mtcars", "datasets").data
# Perform the direct regression
projection = mtcars["hp"] - mtcars["mpg"]
X = mtcars[["vs", "gear"]].assign(const=1.)
direct_reg = sm.OLS(projection, X).fit()
# Print summary information
direct_summary = pd.DataFrame({
"coef": direct_reg.params, "se": direct_reg.bse,
"t_stat": direct_reg.tvalues, "p": direct_reg.pvalues
}).loc["vs"]
Returns
vs | |
---|---|
coef | -106.103 |
se | 18.8541 |
t_stat | -5.62757 |
p | 4.44445e-06 |
The indirect version of this regression uses only hp
and mpg
summary information.
import indirect_GWAS
# Perform feature regressions first
feature_regs = {y: sm.OLS(mtcars[y], X).fit() for y in ["hp", "mpg"]}
feature_coef = pd.DataFrame({y: [reg.params["vs"]] for y, reg in regs.items()})
feature_stderr = pd.DataFrame({y: [reg.bse["vs"]] for y, reg in regs.items()})
# Partial covariance is after residualizing just the covariates
feature_partial_cov = pd.DataFrame({
y: sm.OLS(mtcars[y], X[["gear", "const"]]).fit().resid
for y, reg in regs.items()
}).cov()
# Setup the indirect GWAS dataset
ds = indirect_GWAS.from_summary_statistics(
projection_coefficients=pd.DataFrame([[1], [-1]], index=["hp", "mpg"]),
feature_partial_covariance=feature_partial_cov,
feature_gwas_coefficients=feature_coef,
feature_gwas_standard_error=feature_stderr,
feature_gwas_sample_size=mtcars.shape[0],
n_covar=2, # (includes intercept)
)
# Perform indirect GWAS on the projection
beta, se, t_stat, p = indirect_GWAS.gwas_indirect(ds)
# Print summary information
indirect_summary = pd.Series({
"coef": beta.item(), "se": se.item(), "t_stat": t_stat.item(), "p": p.item()})
Returns
vs | |
---|---|
coef | -106.103 |
se | 18.8541 |
t_stat | -5.62757 |
p | 4.44445e-06 |
These results are identical to 12 decimal places.
assert (direct_summary - indirect_summary).abs().max() < 1e-12