Skip to content
/ AEP Public

Maximum Likelihood for Gaussian Process Classifiers under Case-Control Sampling

License

Notifications You must be signed in to change notification settings

omerwe/AEP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AEP

Maximum Likelihood for Gaussian Process Classifiers under Case-Control Sampling

This page contains the code of the methods AEP (ascertained expectation propagation) and APL (ascertained pairwise likelihood) for fitting Gaussian processes (or generalized linear mixed models) under case-control ascertainment. The methods are described in Weissbrod, Kaufman, Golan and Rosset 2019 JMLR.

Parts of the code are loosely based on code translated from the GPML toolbox.



Installation

The code is designed for Python >=3.6 and requires the following freely available Python packages:

We recommend running the code using the Anaconda Python distribution. In Anaconda, you can install all the Python packages with the command "conda install <package_name>". Alternatively, the Python packages can be installed with the command "pip install --user <package_name>".

Once all the prerequisite packages are installed, you can install the code on a git-enabled machine by typing:

git clone https://github.com/omerwe/AEP
cd AEP
python setup.py build_ext --inplace

The last command will compile some cython code. Please make sure that it completes without any error messages.


Use instructions

You can invoke the code and run an experiment using the file aep.py. This script will simulate ascertained case-control data and will then estimate the model parameters using one of the methods described in the paper (see details below). Here's a tl;dr example:

python aep.py --r 100 --n 500 --frac_cases 0.5 --prev 0.01 --kernel linear --e_dist normal --m 1000 --h2 0.25 --method aep_parallel 

This will generate r=100 datasets, each consisting of n=500 individuals, 50% of which are cases and 50% are controls, assuming that the prevalence of cases in the general population is prev=1%. Each dataset will include m=1000 features used to determine the latent variable g (e.g. genetic variants) via a linear kernel with a normally-distributed likelihood (i.e. a probit link function). The latent variable g will explain 25% of the variance of the underlying liability. The code will try to estimate this variance (i.e. the scale parameter of the linear kernel) using the method aep_parallel. The first few lines of output should look like:

1 mean sig2g-hat: 0.2000 (0.0000)
2 mean sig2g-hat: 0.2000 (0.0000)
3 mean sig2g-hat: 0.2167 (0.0236)
4 mean sig2g-hat: 0.2250 (0.0250)
5 mean sig2g-hat: 0.2500 (0.0548)
6 mean sig2g-hat: 0.2583 (0.0534)

This shows the empirical average and standard deviation of the estimates of the scale parameter, which is updated after running each experiment. For example, after running six experiments the mean estimate was 0.2583 with stdev=0.0534.


Script arguments

You can see all the arguments of aep.py by typing python aep.py --help. Here we describe the most important arguments:

--method: The method used to estimate the parameters. The supported methods are ep (expectation propagation), aep (ascertained expectation propagation), aep_parallel (ascertained expectation propagation with parallel updates), apl (ascertained pairwise likelihood), pcgc (phenotype-correlation-genotype-correlation), and reml (restricted maximum likelihood, using the correction of Lee et al. 2011 AJHG). aep_parallel is a modified version of aep that only updates the posterior covariance matrix Sigma after updating all site parameters (as described in Cseke and Heskes 2011 JMLR). We found that this parallel updating scheme is not only faster but also often leads to better convergence in practice.

--h2: The proportion of liability variance explained by the latent variable g (i.e. heritability in the context of genetics) (default h2=0.25)

--n: Simulated sample size (default n=500)

--m: Number of features used to determine the latent variable g (default m=500)

--frac_cases: The proportion of case individuals in the sample (default frac_cases=0.5)

--prev: The prevalence of cases in the general population (default prev=0.01)

--kernel: The kernel matrix. Right now the code natively supports only linear and rbf. However, the file kernels.py contains the building blocks for a large number of kernels that can be used in principle (with code adopted from the GPML package) (defauler kernel=linear)

--e_dist: The distribution of the residual variable e (i.e. the type of the likelihood function, or the inverse-link function). The code currently supported the options normal (probit likelihood) or logistic (logistic likelihood). When using logistic likelihood, please normalize the variances by also using the flag --scale_g so that the liability variance is 1.0, like in the probit likelihood (default e_dist=probit).

About

Maximum Likelihood for Gaussian Process Classifiers under Case-Control Sampling

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published