Skip to content

This repository contains an implementation of DISC, an algorithm for learning DFAs for multiclass sequence classification.

License

Notifications You must be signed in to change notification settings

andrewli77/DISC

Repository files navigation

Interpretable Sequence Classification via Discrete Optimization

This repository contains the code for DISC and the other baselines from the paper Interpretable Sequence Classification via Discrete Optimization by Shvo, Li, Toro Icarte, and McIlraith.

Please cite this paper if you use any part of this code for academic research.

@article{shvo2020interpretable,
title={Interpretable Sequence Classification via Discrete Optimization},
author={Shvo, Maayan and Li, Andrew C and Toro Icarte, Rodrigo and McIlraith, Sheila A},
journal={arXiv preprint arXiv:2010.02819},
year={2020}}

Overview

DISC is an efficient mixed integer linear programming method for learning deterministic finite automata (DFA) from a set of positive (accepted) and negative (rejected) sequences of symbols. Unlike most existing DFA-learning methods, we aim to perform well on real-world sequence data, even when the data is noisy (and not only when the data is generated by a regular language). Three main hyperparameters affect the learning:

  • q_max: The maximum number of states in the learned DFA.
  • λ_t: A regularization penalty for (non-self-loop) edge transitions.
  • λ_a: A regularization penalty for state-occupancy in non-absorbing states (we define states 1 and 2 to be absorbing states which always self-transition, with state 1 as a non-accepting state and state 2 as an accepting state).

This implementation is designed for sequence classification tasks with multiple classes. A single DFA is learned for each class and Bayesian inference is used to produce a probability distribution over possible classes.

Quick-start guide

Running DISC requires Gurobi Optimizer (available at https://www.gurobi.com) and an active license (academic licenses can be obtained for free). You may need to install some minor Python packages via pip.

To run DISC on the Alfred dataset, use the command python3 run_tree_model.py --dataset=alfred --qmax=5 --id=1 (or alternatively, set these parameters in run_tree_model.py:293-298).

For more information on how to run the code, see the Detailed running instructions section below.

Detailed running instructions

DISC is implemented in the files {DFA_utils_tree_minerror, read_traces, run_tree_model, tree_minerror, tree_utils}.py and is run with python3 run_tree_model.py.

For DISC, LSTM (run from lstm.py), and HMM (run from supervised_hmm.py), the file GLOBAL_VARS.py contains important flags. VALIDATION determines whether part of the training data will be used as a validation set in order to choose hyperparameters. TREE_MINERROR, if true, uses DISC when run_tree_model.py is run, and otherwise runs DFA-FT, another baseline in our experiments. MULTILABEL, if true, assumes there can be multiple labels associated with each observation trace.

run_tree_model.py:150-151 contains lists of hyperparameter values for λ_t, λ_a to search over, when looking to find the best model. run_tree_model.py:296 allows you to set the value of q_max.

Several publicly available datasets are provided in the traces/ directory and can be run directly with DISC. We provide 30 randomly split training and testing sets for each dataset.

About

This repository contains an implementation of DISC, an algorithm for learning DFAs for multiclass sequence classification.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published