Road map

February 2020; updated, July 2023

Please visit contributing guidelines if interested in contributing to MLJ.

Goals

Usability, interoperability, extensibility, reproducibility, and code transparency.
Offer state-of-art tools for model composition and model optimization (hyper-parameter tuning)
Avoid common pain-points of other frameworks with MLJ:
- identify and list all models that solve a given task
- easily perform routine operations requiring a lot of code
- easily transform data, from source to algorithm-specific data format
- make use of probabilistic predictions: no more inconsistent representations / lack of options for performance evaluation
Add some focus to julia machine learning software development more generally

Priorities

Priorities are somewhat fluid, depending on funding offers and available talent. Rough priorities for the core development team at present are marked with † below. However, we are always keen to review external contributions in any area.

Future enhancements

The following road map is more big-picture; see also this GH Project.

Adding models

Integrate deep learning using Flux.jl deep learning. Done but can improve the experience by:
- finishing iterative model wrapper #139
- improving performance by implementing data front-end after (see MLJBase #501) but see also this relevant discussion.
Probabilistic programming: Turing.jl, Gen, Soss.jl #157 discourse thread done but experimental and requires:
- extension of probabilistic scoring functions to "distributions" that can only be sampled.
Feature engineering (python featuretools?, recursive feature elimination ✓ done in FeatureSelection.jl :) #426 MLJModels #314

Enhancing core functionality

Iterative model control #139. Done
† Add more tuning strategies. See here for complete wish-list. Particular focus on:
- random search (#37) (done)
- Latin hypercube done
- Bayesian methods, starting with Gaussian Process methods a la PyMC3. Some preliminary research done.
- POC for AD-powered gradient descent #74
- Tuning with adaptive resource allocation, as in Hyperband. This might be implemented elegantly with the help of the recent IterativeModel wrapper, which applies, in particular to TunedModel instances see here.
- Genetic algorithms #38
- Particle Swarm Optimization (current WIP, GSoC project @lhnguyen-vn)
- tuning strategies for non-Cartesian spaces of models MLJTuning #18, architecture search, and other AutoML workflows
Systematic benchmarking, probably modeled on MLaut #69
Give EnsembleModel a more extendible API and extend beyond bagging (boosting, etc) and migrate to a separate repository? #363
† Enhance complex model composition:
- Introduce a canned stacking model wrapper (POC). WIP @olivierlabayle
- Get rid of macros for creating pipelines and possibly implement target transforms as wrappers (MLJBase #594) WIP @CameronBieganek and @ablaom

Broadening scope

Integrate causal and counterfactual methods for example, applications to FAIRness; see this proposal
Explore the possibility of closer integration of Interpretable Machine Learning approaches, such as Shapley values and lime; see Shapley.jl, ShapML.jl, ShapleyValues.jl, Shapley.jl (older) and this proposal
Spin-off a stand-alone measures (loss functions) package (currently here). Introduce measures for multi-targets MLJBase #502.
Add sparse data support and better support for NLP models; we could use NaiveBayes.jl as a POC (currently wrapped only for dense input) but the API needs to be finalized first {#731](#731). Probably need a new SparseTables.jl package.
POC for implementation of time series models classification #303, ScientificTypesBase #14 POC is here
POC for time series forecasting, along lines of sktime; probably needs MLJBase #502 first, and someone to finish PR on time series CV. See also this proposal
Add tools or a separate repository for visualization in MLJ.
- Extend visualization of tuning plots beyond two-parameters #85 (closed). #416 Done but might be worth adding alternatives suggested in issue.
- visualizing decision boundaries? #342
- provide visualizations that MLR3 provides via mlr3viz
Extend API to accommodate outlier detection, as provided by OutlierDetection.jl #780 WIP @davn and @ablaom
Add more pre-processing tools:
- missing value imputation using Gaussian Mixture Model. Done, via addition of BetaML model, MissingImputator.
- improve autotype method (from ScientificTypes), perhaps by training on a large collection of datasets with manually labelled scitype schema.
Add integration with MLFlow; see this proposal
Extend integration with OpenML WIP @darenasc

Scalability

Roll out data front-ends for all models after MLJBase #501 is merged.
Online learning support and distributed data #60
DAG scheduling for learning network training #72 (multithreading first?)
Automated estimates of cpu/memory requirements #71
Add multithreading to tuning MLJTuning #15 Done.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ROADMAP.md

ROADMAP.md

Road map

Goals

Priorities

Future enhancements

Adding models

Enhancing core functionality

Broadening scope

Scalability

Files

ROADMAP.md

Latest commit

History

ROADMAP.md

File metadata and controls

Road map

Goals

Priorities

Future enhancements

Adding models

Enhancing core functionality

Broadening scope

Scalability