Skip to content

🐦 Generating High Quality multiple type multi-turn Text-to-SQL Q&A training sample with LLM (include Supplementary_Material.pdf)

License

Notifications You must be signed in to change notification settings

mcxiaoxiao/QDA-SQL

Repository files navigation

QDA-SQL: Questions Enhanced Dialogue Augmentation for Multi-Turn Text-to-SQL [Arxiv]

Other language versions: 中文.

QDA-SQL leverages large language models (LLMs) to generate multi-turn dialogue samples with diverse question types for Text-to-SQL data augmentation. Given a set of question examples and databases, it produces high-quality samples. This document uses the CoSQL dataset as an example, which does not include domain-specific knowledge. If domain-specific knowledge is required, please specify it in goals_of_cosql_dev.csv. Additionally, Supplementary_Material.pdf contains written information that aids in understanding the paper.

Citations

@misc{sun2024qdasqlquestionsenhanceddialogue,
      title={QDA-SQL: Questions Enhanced Dialogue Augmentation for Multi-Turn Text-to-SQL}, 
      author={Yinggang Sun and Ziming Guo and Haining Yu and Chuanyi Liu and Xiang Li and Bingxuan Wang and Xiangzhan Yu and Tiancheng Zhao},
      year={2024},
      eprint={2406.10593},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2406.10593}, 
}

Getting Started

😉 To ensure the project runs correctly, you need to first organize your dataset according to the file structure

STEP 1: Parse Dataset (Optional)

(CSV file has already been generated. If no new data needs to be added, you can proceed to the next step. New data can be added following the format in goals_of_cosql_dev.csv). To parse the dataset, specify the path to cosql_all_info_dialogs.json and run cosql_parse_to_csv.py to generate goals_of_cosql_dev.csv.

python cosql_parse_to_csv.py

STEP 2: Generate Question-Answer Pairs

Run classification_generate_multithread.py to generate multi-turn dialogues with classification. Parameters:

  • csv_file_path: Path to the CSV file containing the organized goalsql data.
  • type_needed: Maximum number of dialogues to generate.
  • start_id: Starting ID.
  • end_id: Ending ID.
  • threads: Number of threads.
  • projectname: Name of the folder to save the results.
# Example
python classification_generate_multithread.py --csv_file_path goals_of_cosql_dev.csv --type_needed 10 --start_id 20 --end_id 1000 --threads 5 --projectname "test"

💡 For instance, type_needed = 8 and id_needed = [1000, 1500] mean generating dialogues with up to 8 different combinations of Thematic Relation and Q-A type, within the question_id range [1000, 1500] from goalsql (goals_of_cosql_dev.csv). The generated Q&A pairs will be saved in QAs_generate/c_outputs/XXX.

STEP 3: Filter and Optimize

Run critic_merge_classification_generate.py to merge, filter, and optimize the multi-turn data generated in the previous step. The results will be saved as XXX.json (initial filter), XXX_optimized.json (optimized), and XXX_filtered.json (final filter). Parameters:

  • csv_file_path: Path to the CSV file containing the organized goalsql data.
  • filename: Name of the folder to save the results.
# Example
python critic_merge_classification_generate.py --csv_file_path goals_of_cosql_dev.csv --projectname "test" --threads 5 --savename "merged_test.json"

Evaluation

SQL Generation Task

We follow the SParC evaluation methodology to compute metrics such as Exact Match (EM), Interaction Exact Match (IEM), Execution Match (EX), and Interaction Execution Match (IEX). For detailed definitions, refer to the SParC GitHub page.
We have provided an evaluation script evaluation.py in the evaluation folder, adapted from the SParC test scripts, to evaluate our outputs. An example model output is also provided as example.json.

# Example
python evaluation.py --json example.json --etype all --db ../QAs_generate/datasets/cosql_dataset/database --table ../QAs_generate/datasets/cosql_dataset/tables.json

File Structure

  • QAs_generate/datasets/: Stores datasets, such as QAs_generate/datasets/cosql_dataset/database and QAs_generate/datasets/BIRD/dev/dev_databases. The dataset organization should follow the format of Spider. Recommended datasets:
Dataset Description Download Link
Spider A large-scale complex and cross-domain semantic parsing and text-to-SQL dataset Spider
CoSQL A conversational text-to-SQL dataset CoSQL
SParC A cross-domain semantic parsing in context dataset SParC
BIRD A Big Bench for Large-Scale Database Grounded Text-to-SQLs BIRD

These datasets are organized in format compatible with our generation scripts.

  • QAs_generate/outputs/: Stores generated dialogues, such as QAs_generate/outputs/test2/ for dialogues generated with the project name test2, and dev/outputs/test2_merged.json for the merged results of test2 dialogues (additional processing like difficulty assessment and high-quality result filtering can be added).

  • QAs_generate/tools/: Stores utility tools.

  • QAs_generate/function_test.ipynb: Provides code examples for testing and using various tools, facilitating deeper modifications of each interface.

  • QAs_generate/cosql_parse_to_csv.py: Parses the CoSQL dataset, extracts useful data, and generates a standard dev/goals_of_cosql_dev.csv for subsequent dialogue generation. For other datasets, create corresponding XXXXX_parse files to generate dev/goals_of_XXXXX.csv, keeping the column names unchanged.

  • QAs_generate/token_count.txt: Tracks token usage with the OpenAI API. To reset the count, set the number to 0.

  • QAs_generate/classification_generate_multithread.py: Runs classification_generate in multiple threads to speed up the generation of Q&A samples.

About

🐦 Generating High Quality multiple type multi-turn Text-to-SQL Q&A training sample with LLM (include Supplementary_Material.pdf)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published