-
Notifications
You must be signed in to change notification settings - Fork 2
Improving the performance of multivariate normal models in dirichletprocess
dirichletprocess is a package for fitting nonparametric Bayesian models. It's all written in R and is designed to be easily adapted and modified for easy nonparametric model building. You can drop in dirichletprocess
model objects as part of your model without having to worry about the underlying inference routines or algorithms.
Current multivariate normal sampling in the dirichletprocess
package is slow and doesn’t scale with large dimensions or lots of data.
This limits how the package can be used and restricts its overall adoption. By improving the performance for multivariate normal type models the number of applications and scale of the task sit can accomplish grows significantly. This will help people use Dirichlet process type models without having to implement their own sampling schemes and instead focus on the modelling work they are most interested in.
Dean Markwick wrote this package as part of his PhD and it has also received multiple contributions from different authors over the years. It has received some notable citations from R Koenker and Chris Holmes.
Typical Dirichlet process packages require you to use their specified models and lack the flexibility in constructing custom models. They also use C++ and can be tricky to build upon, requiring more background knowledge.
The mclust package implements EM algo with diagonal covariance matrices with several kinds of constraints (but only classical model selection, no Dirichlet process prior). M step update rules for constrained models are given in https://hal.inria.fr/inria-00074643
- Try changing the imported package from
mvtnorm
tomc2d
(functionrmultinormal
provides random normal generation vectorized on mean/covariance parameters, so it may be quicker). This requires making sure no functionality is broken after making the change and also benchmarking the difference to ensure there is an increase in performance (decrease in computation time and/or memory usage). Detailed benchmarking could also highlight other areas of potential improvement.
The end result will provide:
a. Robust benchmarking scripts.
b. New methods that replace the mvtnorm
with the mc2d
package.
c. Sufficient testing and checking that nothing has broken in the change over.
- Using a constrained covariance matrix. This introduces a new class of models for the multivariate normal distribution to allow for a constrained covariance matrix. This will reduce the number of free parameters and speed up
The end result will provide: a. New class of mixture models. b. Tests to ensure the functionality is correct. c. Documentation and extension to the vignette detailing the new models.
Faster model fitting will help improve the reach of the package and make it a viable option for larger scale problems. Better performance also helps the environment, reducing the amount of CPU cycles needed.
- EVALUATING MENTOR: Dean Markwick dean.markwick@talk21.com. Finance quant, experienced in Bayesian statistics.
- Other Dev: Toby Hocking toby.hocking@r-project.org
Contributors, please do one or more of the following tests before contacting the mentors above.
Easy
Download the package, fit the normal mixture model to the faithful
data set.
Fit the multivariate normal model on the iris
or palmerspenguin
dataset.
Plot the resulting distribution for both models.
Medium Generate some random data from a lognormal distribution mixture model. Fit a Dirichlet process type model to this simulated data. Sample new data from the posterior of the final model and summarise the 5%/95% quantiles of the simulated data. Explore how the prior distribution on the alpha parameter effects the number of clusters. Plot the alpha parameter chains after using different prior distributions to assess how long the model takes to converge.
Hard
Write the MixingDistribution
objects and methods to implement your own custom mixture model.
Fit this new mixture model to some new data (real or simulated) and plot the resulting posterior distribution.
Write out the necessary equations underlying the mixture distribution to be included in the vignette.
Contributors, please post a link to your test results here.
- EXAMPLE CONTRIBUTOR 1 NAME, LINK TO GITHUB PROFILE, LINK TO TEST RESULTS.