This repository contains the code for DISC and the other baselines from the paper Interpretable Sequence Classification via Discrete Optimization by Shvo, Li, Toro Icarte, and McIlraith.
Please cite this paper if you use any part of this code for academic research.
@article{shvo2020interpretable,
title={Interpretable Sequence Classification via Discrete Optimization},
author={Shvo, Maayan and Li, Andrew C and Toro Icarte, Rodrigo and McIlraith, Sheila A},
journal={arXiv preprint arXiv:2010.02819},
year={2020}}
DISC is an efficient mixed integer linear programming method for learning deterministic finite automata (DFA) from a set of positive (accepted) and negative (rejected) sequences of symbols. Unlike most existing DFA-learning methods, we aim to perform well on real-world sequence data, even when the data is noisy (and not only when the data is generated by a regular language). Three main hyperparameters affect the learning:
q_max
: The maximum number of states in the learned DFA.λ_t
: A regularization penalty for (non-self-loop) edge transitions.λ_a
: A regularization penalty for state-occupancy in non-absorbing states (we define states 1 and 2 to be absorbing states which always self-transition, with state 1 as a non-accepting state and state 2 as an accepting state).
This implementation is designed for sequence classification tasks with multiple classes. A single DFA is learned for each class and Bayesian inference is used to produce a probability distribution over possible classes.
Running DISC requires Gurobi Optimizer (available at https://www.gurobi.com) and an active license (academic licenses can be obtained for free). You may need to install some minor Python packages via pip.
To run DISC on the Alfred dataset, use the command
python3 run_tree_model.py --dataset=alfred --qmax=5 --id=1
(or alternatively, set these parameters in run_tree_model.py:293-298
).
For more information on how to run the code, see the Detailed running instructions section below.
DISC is implemented in the files {DFA_utils_tree_minerror, read_traces, run_tree_model, tree_minerror, tree_utils}.py
and is run with python3 run_tree_model.py
.
For DISC, LSTM (run from lstm.py
), and HMM (run from supervised_hmm.py
), the file GLOBAL_VARS.py
contains important flags. VALIDATION
determines whether part of the training data will be used as a validation set in order to choose hyperparameters. TREE_MINERROR
, if true, uses DISC when run_tree_model.py
is run, and otherwise runs DFA-FT, another baseline in our experiments. MULTILABEL
, if true, assumes there can be multiple labels associated with each observation trace.
run_tree_model.py:150-151
contains lists of hyperparameter values for λ_t, λ_a
to search over, when looking to find the best model. run_tree_model.py:296
allows you to set the value of q_max
.
Several publicly available datasets are provided in the traces/
directory and can be run directly with DISC. We provide 30 randomly split training and testing sets for each dataset.