This repository contains implementation results for the capability testing of NLP Models as described in the following paper:
Paper: Automated Testing Linguistic Capabilities of NLP Models
ALiCT is an automated linguistic capability-based testing framework for NLP models. In this implementation, we generate testcases for sentiment analysis and hate speech detection. ALiCT generates seed test cases using SST and HateXplain datasets as the labeled search dataset for the sentiment analysis and hate speech detection respectively. Results of the ALiCT is here. Supplemental artifacts for the results can be downloaded from here
This application is written for Python>3.7.11
. All requirements are listed in requirements.txt
, and they are installed by pip with the following command.
pip install -r requirements.txt
This step is to generate seed and expanded test cases. The test cases are generated with the following command:
cd alict
# Sentiment analysis
python -m python.sa.main \
--run template \
--search_dataset sst \
--search_selection random
# Hate speech detection
python -m python.hs.main \
--run template \
--search_dataset hatexplain \
--search_selection random
Output after running the command are in the result directories of {PROJ_DIR}/_results/templates_sa_sst_random/
and {PROJ_DIR}/_results/templates_hs_hatexplain_random/
for sentiment analysis and hate speech detection respectively.
For each task and its result directory, the following files are generated:
_results/
|- templates_sa_sst_random/
| |- cfg_expanded_inputs_{CKSUM}.json
| |- seeds_{CKSUM}.json
| |- exps_{CKSUM}.json
| |- cksum_map.txt
|- templates_hs_hatexplain_random/
| |- cfg_expanded_inputs_{CKSUM}.json
| |- seeds_{CKSUM}.json
| |- exps_{CKSUM}.json
| |- cksum_map.txt
Where {CKSUM}
represents the checksum value of each unique linguistic capability (LC). The map between the checksum value and its corresponding LC is described in the cksum_map.txt
.
The cfg_expanded_inputs_{CKSUM}.json
contains seed test cases, their Context-Free Grammars (CFGs) and expanded production rules from the seed to expanded test cases.
seeds_{CKSUM}.json
and exps_{CKSUM}.json
contain results of seed and expanded test cases respectively, and they are generated from the intermediate result file, cfg_expanded_inputs_{CKSUM}.json
.
This step is to convert seed and expanded test cases in .json
format into .pkl
testsuite files to evaluate NLP models using huggingface pipelines with the format of checklist test cases. You can run it by executing the following command:
cd alict
# Sentiment analysis
python -m python.sa.main \
--run testsuite \
--search_dataset sst \
--search_selection random
# Hate speech detection
python -m python.hs.main \
--run testsuite \
--search_dataset hatexplain \
--search_selection random
Output testsuite files in .pkl
in the result directories (the result directories of {PROJ_DIR}/_results/test_results_sa_sst_random/
and {PROJ_DIR}/_results/test_results_hs_hatexplain_random/
for sentiment analysis and hate speech detection respectively) are the following:
_results/
|- test_results_sa_sst_random/
| |- sa_testsuite_seeds_{CKSUM}.pkl
| |- sa_testsuite_exps_{CKSUM}.pkl
|- test_results_hs_hatexplain_random/
| |- hs_testsuite_seeds_{CKSUM}.pkl
| |- hs_testsuite_exps_{CKSUM}.pkl
This step is to run the model on our generated test cases. You can run it by the following command:
cd alict
# Sentiment analysis
python -m python.sa.main \
--run testmodel \
--search_dataset sst \
--search_selection random
# Hate speech detection
python -m python.hs.main \
--run testmodel \
--search_dataset hatexplain \
--search_selection random
Then, the test results are written in test_results.txt
_results/
|- test_results_sa_sst_random/
| |- sa_testsuite_seeds_{CKSUM}.pkl
| |- sa_testsuite_exps_{CKSUM}.pkl
| |- test_results.txt
|- test_results_hs_hatexplain_random/
| |- hs_testsuite_seeds_{CKSUM}.pkl
| |- hs_testsuite_exps_{CKSUM}.pkl
| |- test_results.txt
This step is to analyze the reults from Step 3 and get the test results into test_result_analysis.json
. The test results are The file is parsed version of
test_results.txt.
You can run it by the following command:
cd s2lct
# Sentiment analysis
python -m python.sa.main \
--run analyze \
--search_dataset sst \
--search_selection random
# Hate speech detection
python -m python.hs.main \
--run analyze \
--search_dataset hatexplain \
--search_selection random
Output results are test_result_analysis.json
.
_results/
|- test_results_sa_sst_random/
| |- sa_testsuite_seeds_{CKSUM}.pkl
| |- sa_testsuite_exps_{CKSUM}.pkl
| |- test_results.txt
| |- test_result_analysis.json
|- test_results_hs_hatexplain_random/
| |- hs_testsuite_seeds_{CKSUM}.pkl
| |- hs_testsuite_exps_{CKSUM}.pkl
| |- test_results.txt
| |- test_result_analysis.json
This step is to run the aforementioned steps for the LC of fairness for the sentiment analysis task.
# 1. Seed and Expanded Test Case Identification
python -m python.sa.main \
--run template_fairness \
--search_dataset sst \
--syntax_selection random \
--num_seeds -1 \
--num_trials 1
# 2. Testsuite Generation
python -m python.sa.main \
--run testsuite_fairness \
--search_dataset sst \
--syntax_selection random \
--num_seeds -1 \
--num_trials 1
# 3. Run Model on The Generated Testsuites
python -m python.sa.main \
--run testmodel_fairness \
--search_dataset sst \
--syntax_selection random \
--num_seeds -1 \
--num_trials 1
# 4. Analyze The Testing Results
python -m python.sa.main \
--run analyze_fairness \
--search_dataset sst \
--syntax_selection random \
--num_seeds -1 \
--num_trials 1
Output testsuite files are in the result directories (the result directories of {PROJ_DIR}/_results/test_results_sa_sst_random_for_fairness/
and {PROJ_DIR}/_results/test_results_hs_hatexplain_random_for_fairness/
for sentiment analysis and hate speech detection respectively) are the following:
_results/
|- test_results_sa_sst_random_for_fairness/
| |- sa_testsuite_fairness_seeds_dbc8046.pkl
| |- sa_testsuite_fairness_exps_dbc8046.pkl
| |- test_results_fairness.txt
| |- test_result_fairness_analysis.json
|- templates_hs_hatexplain_random_for_fairness/
| |- hs_testsuite_fairness_seeds_dbc8046.pkl
| |- hs_testsuite_fairness_exps_dbc8046.pkl
| |- test_results_fairness.txt
| |- test_result_fairness_analysis.json
This step is to run the aforementioned steps for evaluating Large Language Model (LLM), GPT3.5 model in this implementation, over LCs.
# Sentiment analysis
## 1. Run LLM on ALiCT test cases
python -m python.sa.main \
--run testmodel_tosem \
--search_dataset sst \
--syntax_selection random
## 2. Analysis of result
python -m python.sa.main \
--run analyze_tosem \
--search_dataset sst \
--syntax_selection random \
--num_seeds -1 \
--num_trials 1
# Hate speech detection
## 1. Run LLM on ALiCT test cases
python -m python.hs.main \
--run testmodel_tosem \
--search_dataset hatexplain \
--syntax_selection random
## 2. Analysis of result
python -m python.hs.main \
--run analyze_tosem \
--search_dataset hatexplain \
--syntax_selection random \
--num_seeds -1 \
--num_trials 1
Output results are as follows:
_results/
|- test_results_sa_sst_random/
| |- sa_testsuite_tosem_seeds_{CKSUM}.pkl
| |- sa_testsuite_tosem_exps_{CKSUM}.pkl
| |- test_results_tosem.txt
| |- test_result_tosem_analysis.json
|- test_results_hs_hatexplain_random/
| |- hs_testsuite_tosem_seeds_{CKSUM}.pkl
| |- hs_testsuite_tosem_exps_{CKSUM}.pkl
| |- test_results_tosem.txt
| |- test_result_tosem_analysis.json
Supplemental artifacts for the results can be downloaded from here