- Description and Purpose
- Requirements
- How to Run
- Example Usage
- YAML Configuration File
- Error Handling
- Output
- Experiments
- Synthetic-Trained Models
- MultiWoZ-Trained Models and A/B Testing
This repository provides a script to generate synthetic dialogue datasets using a schema-first approach. The pipeline processes input configurations to produce human-readable dialogues annotated in MultiWoZ2.2 format. The script main.py
, its dependencies and auxiliary modules can be found in the DatasetGenerator
- Python 3.10+
- valid OpenAI API key
pyyaml 6.0.1
openai 1.9.0
The main.py
script can be ran from any CLI terminal.
The user can specify the domain and how many dialogues can be generated by using a YAML file or directly through a list of domains.
-y, --yaml: Path to the YAML dataset configuration file.
-l, --list: List of domains to generate datasets for (e.g., hotel train).
-r, --repetitions: Number of dialogues to generate for the dataset (default is 1).
python main.py -y path/to/config.yaml
python main.py -l hotel train -r 10
The YAML configuration file should list the domains for which datasets need to be generated. Here is an example configuration:
- hotel
- train
Save the above content in a file, e.g., config.yaml, and provide its path using the -y argument.
The script currently supports the following domains:
If an unsupported domain is specified, the script will print a warning and continue processing.
The script includes retry logic to handle errors during dialogue generation. If an error occurs, it will retry up to three times before moving to the next step.
Generated dialogues are saved in JSON format in the IntermediaryFiles
directory. The files are named synthetic_dataset_<domain>.json
for each domain.
The following Python libraries must be installed in order to run the experiments:
pandas 2.1.1
transformers 4.40.2
torch 2.3.0
tqdm 4.66.4
numpy 1.26.3
scikit-learn 1.2.1
plotly 5.20.0
The code for training the SVM, Na"ive Bayes and BERT models on the synthetic data can be found within the directory Experiments/SynthDataset/
Before using any new synthetic dataset, the files must be processed by running the gpt_classifier.ipynb
and json_processing.ipynb
in order to make sure that the annotation is aligned to MultiWoZ 2.2 standards.
The files can be found within the Experiments/SynthDataset/
and they must be pointed to the new synthetic dataset's path.
The files containing the experiments are:
For SVM and Na"ive Bayes:
The files can be run as they are provided that the paths towards the json files containing the synthetic dialogue data are specified.
The code for training the SVM, Na"ive Bayes and BERT models on the MultiWoZ data can be found within the directory Experiments/MultiWozDataset/code
The files can be used to train the MultiWoZ models, but also to test loaded synth-trained models, provided the necessary paths are added to the files.
The files containing the experiments are:
For SVM and Na"ive Bayes: