vignette.Rmd

---
title: "MR-Hevo"
author: Paul McKeigue
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{MR-Hevo}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r opts, include = FALSE, eval = TRUE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
  )

# devtools::check(vignettes=FALSE)
# devtools::document()
# devtools::build(vignettes=FALSE)
# R CMD INSTALL --library=~/R/x86_64-pc-linux-gnu-library/3.6 ./mrhevo_0.0.0.9000.tar.gz

```

## Introduction
This vignette demonstrates an analysis using `MRHevo` to overcome some of the limitations of current methods for instrumental variable analysis with genetic instruments ("Mendelian randomization"). 

* Multiple variants are used to construct each unlinked instrument, rather than selecting a single SNP from each genomic region that is associated with the exposure. 

* There is no need to exclude weak instruments because they are correctly handled by Bayesian inference. 

* Inference of causality in the presence of pleiotropic effects is based on marginalizing over the distribution of pleiotropic effects to compute the likelihood as a function of the causal effect parameter, rather than on constructing "estimators" that downweight or exclude those instruments that appear to be outliers.  The statistical model is described on [this page](https://github.com/molepi-precmed/mrhevo/blob/main/theorymethods.pdf. 

Where multiple SNPs are used to calculate each scalar instrument from summary statistics for the genotype-exposure association, an individual-level dataset with genotypes and outcome is required in step 2 to calculate summary statistics for the effects of these instrument on the outcome.  The model-fitting can use either summary statistics (fast) or individual-level data (slow if the dataset is large).  

## Input dataset

The input dataset should comprise: 

* `Y` a vector of length `N` containing outcomes (in this example type 2 diabetes as a binary outcome

* `Z` an `N x J` data.table of unlinked genetic instruments.  

* `X_u` an `N x U` data.table of unpenalized covariates (this will usually include scores on the first few principal components of the genotype matrix

* `Z.metadata` a data.table with `J` rows, one for each genetic instrument.  Estimated coefficients for the effects of instruments `Z` on the exposure, and the corresponding standard errors of these estimates are in columns named `alpha_hat` and `se_alpha_hat` respectively.  The column `scoreid` should contain the names of the columns of `Z`. This table can include other metadata such as the genomic position of each genotypic instrument, or the names of nearby genes.  

In this example the genetic instruments consist of 12 _trans_-pQTLs for the levels of adiponectin, encoded by the gene _ADIPOQ_ in whole blood.  There are also two _cis_-pQTLs that are not used in the instrumental variable analysis, because _cis_-acting variants in or near this gene are known to affect the splicing of the transcript.  Summary statistics for these pQTLs were derived from the [DeCODE proteomics study](https://pubmed.ncbi.nlm.nih.gov/34857953/) which used the Somalogic platform

The _trans_-pQTLs are far enough apart to be modelled as independent.  The target dataset is the UK Biobank cohort, and the outcome is type 2 diabetes.  The unpenalized covariates are sex and the first five principal components of the genotype matrix.  


```{r setup, eval=TRUE}

library(data.table)
library(ggplot2)
library(rstan)
require(knitr)

source("functions.mrhevo.R")

load("data_mrhevo.RData")

if(length(unique(Y)) == 2) {
  logistic <- TRUE
} else {
  logistic <- FALSE
}

```

As a first step, we compute summary statistics for the effect of each instrument on the outcome, and the ratio of the instrument-outcome coefficient `beta_YZ` to the instrument-exposure coefficient `alpha_hat`.  An approximate standard error for the ratio `theta_IV`, defined as `beta_YZ / alpha_hat`, is calculated by the delta method (a first-order Taylor expansion).  

## Summary statistics

```{r summarystats, include=TRUE, eval=TRUE}

coeffs.scores <- get_summarystatsforMR(Y, Z, X_u)

coeffs.scores <- Z.metadata[coeffs.scores, on="scoreid"]
coeffs.scores[, thetaIV := beta_YZ / alpha_hat]  # ratio estimates
coeffs.scores[, se.thetaIV := SE_YZ / alpha_hat] 
coeffs.scores[, se.thetaIV_delta := sqrt((SE_YZ / alpha_hat)^2 +
                                     beta_YZ^2 * se.alpha_hat^2 / alpha_hat^4)] # delta method

knitr::kable(coeffs.scores[, .(qtlname, qtl_type, alpha_hat, se.alpha_hat,
                               beta_YZ, SE_YZ, thetaIV, z, p)])
```

From the summary statistics (excluding the _cis_-eQTL) we can obtain three widely-used estimators of the causal effect parameter: the inverse-variance weighted mean, the weighted median, and the penalized weighted median.  As expected, the weighted median and the penalized weighted median downweight the outlier and give less extreme estimates of the causal effect.

## Estimators of the causal effect based on summary statistics

```{r estimators, eval=TRUE}

estimators.dt <- get_estimatorsMR(coeffs.scores[qtl_type=="trans"])[, 1:5] 

knitr::kable(estimators.dt)

```

A plot of `beta_YZ` against `alpha_hat` shows that 11 of the 12 _trans_-pQTL instruments are inversely associated with type 2 diabetes.  The _cis- pQTLs have large effects on circulating levels of the protein, but no effect on type 2 diabetes; this is this is explicable, as the _cis_- acting SNPs alter the splicing of _ADIPOQ_. Of the _trans_-pQTLs, the _APOC1_ locus is an outlier, with a more extreme ratio of `beta_YZ` to `alpha_hat` than the others. 

```{r plotcoeffs, include=TRUE, eval=TRUE, fig.width=7, fig.asp=1}

## size of points inversely proportional to standard error 
coeffs.scores[, size.thetaIV := 0.5 * sum(se.thetaIV_delta) / se.thetaIV_delta] 

gene.query <- "ADIPOQ"
long.phenoname <- "type 2 diabetes"
p.coeffs <- ggplot(coeffs.scores,
                                    aes(x=alpha_hat, y=beta_YZ, color=qtl_type)) + 
      geom_point(aes(size=size.thetaIV), alpha=0.8) +
      scale_size(guide="none") + 
    ggrepel::geom_text_repel(aes(label=qtlname), force=2, size=4, fontface="italic", color="blue") + 
    scale_x_continuous(limits = c(0, NA), expand = expansion(mult = c(0, 0.1))) + 
    scale_y_continuous(expand =  expansion(mult = c(0.1, 0.1))) + 
    geom_abline(data=estimators.dt,
                aes(slope=Estimate, intercept=rep(0, 3),
                    linetype=Estimator)) +
    theme(legend.position="top") +
    theme(legend.direction="vertical") +
    theme(legend.box="horizontal") +
    guides(color=guide_legend(title="QTL type")) +
    xlab(paste("Effect of genetic instrument on ", gene.query)) +
    ylab(paste("Effect on", long.phenoname)) +
  coord_fixed(ratio=2)
save(p.coeffs, file="p.coeffs.ADIPOQ.RData")
p.coeffs

```


## Sample the posterior density

The function `run_mrhevo.sstats` will call `Stan` to sample the posterior distribution of all parameters, given the model and the data consisting of summary statistics.  It returns an object of class `stanfit`

As recommended by [Piironen and Vehtari (2017)](https://doi.org/10.1214/17-EJS1337SI), we supply a prior guess of the fraction of instruments that have nonzero pleiotropic effects.  With a large enough sample size, this prior guess will not make much difference to the results. For this example, we guess the fraction to be 0.2. This guess is used to set the scale of the prior on the global shrinkage parameter.  

The causal effect parameter `theta` is assigned a Gaussian prior with mean zero and standard deviation `priorsd.theta`.  As we will divide the posterior by the prior to obtain the likelihood as a function of `theta`, the setting of this prior should not make much difference but the sampler will be more efficient if the prior standard deviation is set to a value that is supported by the data. 

To speed up computation in this large dataset, the statistical model is fitted to summary statistics rather than to individual-level data.  

```{r sampling, eval=TRUE}

options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)

use.sampling <- TRUE
logistic <- TRUE
priorsd_theta <- 1
newrun <- FALSE

which.trans <- colnames(Z) %in% Z.metadata[qtl_type=="trans", scoreid]
J <- length(which.trans)
N <- nrow(Z)

gamma_hat <- coeffs.scores[qtl_type=="trans", beta_YZ]
se.gamma_hat <- coeffs.scores[qtl_type=="trans", SE_YZ]
alpha_hat <- coeffs.scores[qtl_type=="trans", alpha_hat]
se.alpha_hat <- coeffs.scores[qtl_type=="trans", se.alpha_hat]

hevo.stanfit <- run_mrhevo.sstats(alpha_hat, se.alpha_hat,
                                  gamma_hat, se.gamma_hat,
                                  fraction_pleio=0.2,
                                  priorsd_theta=priorsd_theta)
```

## Maximum likelihood estimate and _p_-value
From the posterior samples, we compute the likelihood as a function of `theta`.  We fit a quadratic approximation to the log-likelihood, and use this to obtain a classical maximum likelihood and _p_-value.  

A plot of the posterior density of `theta` shows that it is approximately Gaussian, and a plot of the log-likelihood shows that it can be approximated by a quadratic function.  This guarantees that the value of `theta` that maximizes the likelihood will have the sampling properties -- consistency, unbiasedness, and minimum variance -- that are desirable for an "estimator". 

```{r mle, eval=TRUE, fig.width=5, fig.asp=1.2}

theta.samples <-  unlist(extract(hevo.stanfit, pars="theta"))
prior.theta <- dnorm(theta.samples, mean=0, sd=priorsd_theta)
mle.theta <- mle.se.pval(x=theta.samples, prior=prior.theta)
mle.theta[, Estimator := "Marginalize over direct effects"]

p.bayesianloglik <-  mle.se.pval(x=theta.samples, prior=prior.theta, return.asplot=TRUE)
p.bayesianloglik

```

## Comparison with other estimates

Marginalizing over the distribution of unobserved pleiotropic effects is more conservative than the standard methods for inferring the causal effect parameter, at least in this example.  


```{r methodscomparison, eval=TRUE}

methods.dt <- rbind(estimators.dt, mle.theta, fill=TRUE)
knitr::kable(methods.dt[, .(Estimator, Estimate, SE, z, pvalue)])

```

## Shrinkage coefficients
For each instrument, we can compute from the model parameters a shrinkage coefficent `kappa` that takes values between 0 (no shrinkage of the pleiotropic effect) and 1 (complete shrinkage of the effect to zero).  The posterior distribution of the `J` shrinkage coefficients has the horseshoe shape specified by the prior. 


```{r kappa, eval=TRUE, fig.width=5, fig.asp=1}

kappa.all <- extract(hevo.stanfit, pars="kappa")$kappa
kappa.dt <- data.table(kappa=as.numeric(kappa.all))

p.kappa <- ggplot(kappa.dt, aes(x=kappa)) +
  geom_histogram(aes(y=after_stat(density)),
                 breaks=seq(0, 1, by=0.05), color="black", fill="gray") +
  labs(x = expression(paste("Shrinkage coefficient ", kappa, " imposed on direct effects of instruments")),
       y= "Posterior density")
p.kappa

```

## Coefficients for pleiotropic effects
Summaries of the posterior distribution show that the pleiotropic effect `beta[11]` which represents the _APOC1_ locus has escaped shrinkage, but the pleiotropic effects of the other four instruments have been shrunk to near zero.  

```{r summary.params, eval=TRUE}

if(use.sampling) {
  pars.forsummary <- c("theta", "beta", "log_c", "log_tau", "m_eff", "lp__")
} else {
  pars.forsummary <- c("theta", "beta", "log_c", "log_tau", "m_eff")
}

fit.coeffs <- summary(hevo.stanfit,
                      pars=pars.forsummary, 
                      probs=c(0.1, 0.9))$summary 
fit.coeffs <- as.data.table(round(fit.coeffs, 2), keep.rownames="variable")

knitr::kable(fit.coeffs)

```

## Sampler diagnostics
A pairs plot is useful to diagnose problems with the sampler.  

```{r diagnostics, eval=TRUE, fig.width=6, fig.asp=1}

if(use.sampling) {
  pairs(hevo.stanfit, pars=c("theta", 
                     "log_c", "log_tau", "m_eff",
                     "lp__"))
}

```