Towards Fine-Grained Pedagogical Control over English Grammar Complexity in Educational Text Generation

This project about using LLMs for grammar-controlled educational text generation originated in the Yale course CPSC488/588 "AI Foundation Models" by Arman Cohan. In collaboration with Detmar Meurers, it was submitted to the BEA Workshop at NAACL 2024. The goal is to use GPT models to generate examples for the English Grammar Profile collection of grammar constructs. The augmented dataset is used to train BERT-based grammar detectors. The grammar detectors are used to re-rank sentence candidates from Mistral-7B-Instruct-v0.2 by the complexity of the used grammar to make educational text appropriate for learners with differing language proficiency.

Link to the Paper

Data

The project provides 946K generated example sentences for all 1,222 EGP entries (at least 500 positives and 250 negatives). Note that these are automatically generated. The quality estimate revealed that 87.1% can be assumed to be correctly labeled. The entire dataset can be found in the main directory under EGP_examples.json.

Models

There are six trained classifcation models that need to be in the folder models/classifiers, one per CEFR level. You can download them from Google Drive. For an example on how to use them, see the functions load_model and get_scores in /source/models.py.

Running the experiments and scripts

Prerequisites

Ensure you have conda installed on your machine.

Installing the environment

To install the Conda environment for this project, run the following command:

conda env create -f environment.yml

Before running any scripts, ensure you activate the environment with:

conda activate llm

For using spacy in experiments 8-10, please execute first

python -m spacy download en_core_web_sm

If you want to generate text using the Google Cloud, install the gcloud CLI following their guide.

Configuring the environment

Create a copy of config.py.example named config.py and insert your API keys and path to the gcloud credentials of a service account with the appropriate rights.

Execute scripts

Experiments

You can reproduce the chronologically enumerated experiments by executing the Jupyter notebooks in the folder exp. However, it is suggested for reproduction to use the scripts below.

Example generation

To create the augmented dataset, execute the Python script generate_examples.py in the directory src with the following options:

--examples-per-batch EXAMPLES_PER_BATCH
                    Positive and negative examples per batch (default: 20)
--batches BATCHES     Batches (default: 5)
--samples-per-level SAMPLES_PER_LEVEL
                    Samples per CEFR level (default: 1)
--input-file INPUT_FILE
                    Name of input file in folder dat (default: egponline.csv)
--output-file OUTPUT_FILE
                    Name of output file in folder dat (default: egpaugmented.json)

Training the classification models

For the training of the classifiers, execute the Python script train_classifiers.py in the directory src with the following options:

  --input-file INPUT_FILE
                        Name of input file in folder dat (default: egpaugmented.csv)
  --output-dir OUTPUT_DIR
                        Name of output directory for model checkpoints (default: models)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Towards Fine-Grained Pedagogical Control over English Grammar Complexity in Educational Text Generation

Data

Models

Running the experiments and scripts

Prerequisites

Installing the environment

Configuring the environment

Execute scripts

Experiments

Example generation

Training the classification models

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
dat		dat
doc		doc
exp		exp
models		models
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
EGP_examples.json		EGP_examples.json
README.md		README.md
config.py.example		config.py.example
environment.yml		environment.yml

dominikglandorf/LLM-grammar

Folders and files

Latest commit

History

Repository files navigation

Towards Fine-Grained Pedagogical Control over English Grammar Complexity in Educational Text Generation

Data

Models

Running the experiments and scripts

Prerequisites

Installing the environment

Configuring the environment

Execute scripts

Experiments

Example generation

Training the classification models

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages