Towards Fine-Grained Pedagogical Control over English Grammar Complexity in Educational Text Generation
This project about using LLMs for grammar-controlled educational text generation originated in the Yale course CPSC488/588 "AI Foundation Models" by Arman Cohan. In collaboration with Detmar Meurers, it was submitted to the BEA Workshop at NAACL 2024. The goal is to use GPT models to generate examples for the English Grammar Profile collection of grammar constructs. The augmented dataset is used to train BERT-based grammar detectors. The grammar detectors are used to re-rank sentence candidates from Mistral-7B-Instruct-v0.2 by the complexity of the used grammar to make educational text appropriate for learners with differing language proficiency.
The project provides 946K generated example sentences for all 1,222 EGP entries (at least 500 positives and 250 negatives). Note that these are automatically generated. The quality estimate revealed that 87.1% can be assumed to be correctly labeled. The entire dataset can be found in the main directory under EGP_examples.json
.
There are six trained classifcation models that need to be in the folder models/classifiers
, one per CEFR level. You can download them from Google Drive. For an example on how to use them, see the functions load_model
and get_scores
in /source/models.py
.
- Ensure you have
conda
installed on your machine.
- To install the Conda environment for this project, run the following command:
conda env create -f environment.yml
- Before running any scripts, ensure you activate the environment with:
conda activate llm
- For using spacy in experiments 8-10, please execute first
python -m spacy download en_core_web_sm
- If you want to generate text using the Google Cloud, install the gcloud CLI following their guide.
- Create a copy of
config.py.example
namedconfig.py
and insert your API keys and path to the gcloud credentials of a service account with the appropriate rights.
You can reproduce the chronologically enumerated experiments by executing the Jupyter notebooks in the folder exp
. However, it is suggested for reproduction to use the scripts below.
To create the augmented dataset, execute the Python script generate_examples.py
in the directory src
with the following options:
--examples-per-batch EXAMPLES_PER_BATCH
Positive and negative examples per batch (default: 20)
--batches BATCHES Batches (default: 5)
--samples-per-level SAMPLES_PER_LEVEL
Samples per CEFR level (default: 1)
--input-file INPUT_FILE
Name of input file in folder dat (default: egponline.csv)
--output-file OUTPUT_FILE
Name of output file in folder dat (default: egpaugmented.json)
For the training of the classifiers, execute the Python script train_classifiers.py
in the directory src
with the following options:
--input-file INPUT_FILE
Name of input file in folder dat (default: egpaugmented.csv)
--output-dir OUTPUT_DIR
Name of output directory for model checkpoints (default: models)