Dialogue Dataset Generator

Description and Purpose

This repository provides a script to generate synthetic dialogue datasets using a schema-first approach. The pipeline processes input configurations to produce human-readable dialogues annotated in MultiWoZ2.2 format. The script main.py, its dependencies and auxiliary modules can be found in the DatasetGenerator folder.

Requirements

Auxiliary

Python 3.10+
valid OpenAI API key

Python Libraries

pyyaml 6.0.1
openai 1.9.0

How to Run

The main.py script can be ran from any CLI terminal.

The user can specify the domain and how many dialogues can be generated by using a YAML file or directly through a list of domains.

Command Line Arguments

-y, --yaml: Path to the YAML dataset configuration file.
-l, --list: List of domains to generate datasets for (e.g., hotel train).
-r, --repetitions: Number of dialogues to generate for the dataset (default is 1).

Example Usage

Using a YAML Configuration File

python main.py -y path/to/config.yaml

Using a List of Domains

python main.py -l hotel train -r 10

YAML Configuration File

The YAML configuration file should list the domains for which datasets need to be generated. Here is an example configuration:

Example YAML File

domains:
  - hotel
  - train

Save the above content in a file, e.g., config.yaml, and provide its path using the -y argument.

Supported Domains

The script currently supports the following domains:

hotel
train

If an unsupported domain is specified, the script will print a warning and continue processing.

Error Handling

The script includes retry logic to handle errors during dialogue generation. If an error occurs, it will retry up to three times before moving to the next step.

Output

Generated dialogues are saved in JSON format in the IntermediaryFiles directory. The files are named synthetic_dataset_<domain>.json for each domain.

Experiments

Requirements

The following Python libraries must be installed in order to run the experiments:

pandas 2.1.1
transformers 4.40.2
torch 2.3.0
tqdm 4.66.4
numpy 1.26.3
scikit-learn 1.2.1
plotly 5.20.0

Synthetic-Trained Models

The code for training the SVM, Na"ive Bayes and BERT models on the synthetic data can be found within the directory Experiments/SynthDataset/.

Before using any new synthetic dataset, the files must be processed by running the gpt_classifier.ipynb and json_processing.ipynb in order to make sure that the annotation is aligned to MultiWoZ 2.2 standards.

The files can be found within the Experiments/SynthDataset/ and they must be pointed to the new synthetic dataset's path.

The files containing the experiments are:

For SVM and Na"ive Bayes:

domain_synth.ipynb
intent_synth.ipynb
slot_synth.ipynb

For BERT:

BERT_Synth_Domain.ipynb
BERT_Synth_Intent.ipynb
BERT_Synth_Slot.ipynb

The files can be run as they are provided that the paths towards the json files containing the synthetic dialogue data are specified.

MultiWoZ-Trained Models and A/B Testing

The code for training the SVM, Na"ive Bayes and BERT models on the MultiWoZ data can be found within the directory Experiments/MultiWozDataset/code.

The files can be used to train the MultiWoZ models, but also to test loaded synth-trained models, provided the necessary paths are added to the files.

The files containing the experiments are:

For SVM and Na"ive Bayes:

multiwoz_active_intent.ipynb
multiwoz_domain.ipynb
multiwoz_slot_values.ipynb

For BERT:

BERT_MultiWoZ_Domain.ipynb
BERT_MultiWoZ_Intent.ipynb
BERT_MultiWoZ_Slot.ipynb

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
DatasetGenerator		DatasetGenerator
Experiments		Experiments
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Table of Contents

Dialogue Dataset Generator

Description and Purpose

Requirements

Auxiliary

Python Libraries

How to Run

Command Line Arguments

Example Usage

Using a YAML Configuration File

Using a List of Domains

YAML Configuration File

Example YAML File

Supported Domains

Error Handling

Output

Experiments

Requirements

Synthetic-Trained Models

MultiWoZ-Trained Models and A/B Testing

About

Releases

Packages

Languages

License

Devix71/nlu_dialogue_dataset_generator

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

Dialogue Dataset Generator

Description and Purpose

Requirements

Auxiliary

Python Libraries

How to Run

Command Line Arguments

Example Usage

Using a YAML Configuration File

Using a List of Domains

YAML Configuration File

Example YAML File

Supported Domains

Error Handling

Output

Experiments

Requirements

Synthetic-Trained Models

MultiWoZ-Trained Models and A/B Testing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages