- Respository Oveview: This repository contains the code, results and dataset for the paper "Can Knowledge Editing Really Correct Hallucinations?"
- TLDR: We proposed HalluEditBench to holistically benchmark knowledge editing methods in correcting real-world hallucinations on five dimensions including Efficacy, Generalization, Portability, Locality, and Robustness. We find that their effectiveness could be far from what their performance on existing datasets suggests, and the performance beyond Efficacy for all methods is generally unsatisfactory.
- Authors : Baixiang Huang*, Canyu Chen*, Xiongxiao Xu, Ali Payani, Kai Shu (*equal contributions)
- Correspondence to: Kai Shu <kai.shu@emory.edu>.
- Paper : Read our paper
- Project Website: Visit the project website https://llm-editing.github.io for more resources.
Large Language Models (LLMs) suffer from hallucinations, referring to the non-factual information in generated content, despite their superior capacities across tasks. Meanwhile, knowledge editing has been developed as a new popular paradigm to correct the erroneous factual knowledge encoded in LLMs with the advantage of avoiding retraining from scratch. However, one common issue of existing evaluation datasets for knowledge editing is that they do not ensure LLMs actually generate hallucinated answers to the evaluation questions before editing. When LLMs are evaluated on such datasets after being edited by different techniques, it is hard to directly adopt the performance to assess the effectiveness of different knowledge editing methods in correcting hallucinations. Thus, the fundamental question remains insufficiently validated: Can knowledge editing really correct hallucinations in LLMs?
We proposed HalluEditBench to holistically benchmark knowledge editing methods in correcting real-world hallucinations. First, we rigorously construct a massive hallucination dataset with 9 domains, 26 topics and more than 6,000 hallucinations. Then, we assess the performance of knowledge editing methods in a holistic way on five dimensions including Efficacy, Generalization, Portability, Locality, and Robustness. Through HalluEditBench, we have provided new insights into the potentials and limitations of different knowledge editing methods in correcting hallucinations, which could inspire future improvements and facilitate the progress in the field of knowledge editing.
data/
: Contains the hallucination detection dataset.code/
: Includes scripts and code to evaluate hallucination mitigation using knowledge editing methods (and reproduce the results in the paper).results/
: Results of the experiments that we report in the paper.
To set up the environment for running the code, follow these steps:
-
Clone the repository:
git clone https://github.com/link-omitted-during-review/hallu-edit.git cd hallu-edit
-
Create a virtual environment and activate it:
conda create -n HalluEdit python=3.9 conda activate HalluEdit
-
Install the required dependencies:
pip install -r requirements.txt
- Datasets are stored in the
data/
directory. There are three folders:
data/
├── questions
│ └── hallucination_final
│ ├── llama_2_7b_chat_hf
│ ├── meta_llama_3_8b_instruct
│ └── mistral_7b_instruct_v0.3
├── topic
└── triplet
questions
contains the pre-processed hallucination detection dataset, including the questions we used to evaluate the editing methods. topic
contains the topics we selected from WikiData, and triplet
contains the raw knowledge triplets that were used to generate the questions for hallucination detection.
Run example: To get started (e.g. using ROME to edit llama3-8b on the places_landmark data), run:
cd ./code
python3 edit_all_method.py \
--model_name=llama3-8b \
--edit_method=ROME \
--topic_name=places_landmark \
--device_edit=0 \
--device_eval=1 \
--data_size=5 \
--results_dir=../new_results_dir \
--question_types rephrase_questions questions_2hop
Note:
- Without specifying the
--edit_method
, the script will run 7 editing methods sequentially by default. - Specify
--question_types
to choose specific types of questions in the evaluation (The example above will only evalute 2-hop questions and rephrased questions). Otherwise, the script will run all the question types (yes_questions, no_questions, locality_questions, rephrase_questions, multiple_choice_questions, reversed_relation_questions, questions_2hop, questions_3hop, questions_4hop, questions_5hop, questions_6hop). The original questions is always included. - Specify
--results_dir
to save the results to a specific directory, otherwise the default directory is where we save the results that we report in the paper. You can also use--overwrite_result
to overwrite the existing result file.
To run the multi-turn editing, here is an example:
python3 edit_all_method_multi_turn.py \
--model_name=llama3-8b \
--edit_method=ROME \
--topic_name=places_landmark \
--device_edit=0 \
--device_eval=1 \
--model_eval=meta-llama/Meta-Llama-3-8B-Instruct \
--data_size=5 \
--results_dir=../new_results_dir \
--multi_turn=yes \
--multi_turn_num=10
- Use
--multi_turn
to choose the type of multi-turn evaluation (yes
orsure
). - Use
--multi_turn_num
to set the number of turns for multi-turn evaluation.
We use a local LLM (e.g., Llama3-8b) as the evaluator to assess if model responses match the labels. For experiments, we recommend using at least one GPU with 48 GB of memory (e.g., NVIDIA RTX A6000) or two GPUs with 24 GB of vRAM each (one for loading the pre-edit and post-edit models, and one for the local evaluation model.) Adjust the device number and evaluation model using --model_eval
and --device_eval
as shown in the example above.
For full experiments to reproduce the results in the paper:
-
Experiment for all the 26 topics:
./edit_all_topic.sh
-
Experiment for the robustness evaluation:
./code/edit_all_topic_multi_turn.sh
We evaluate instruction-tuned models including Llama-2-7B-chat
, Llama-3-8B-Instruct
, and Mistral-7B-v0.3
. All parameters are in the code/hparams/<method_name>/<model_name>
.
Results are stored at llama_2_7b_chat_hf
, meta_llama_3_8b_instruct
, mistral_7b_instruct_v0.3
under the results
folder.
To summarize the results, use the jupyter notebook code/result_table.ipynb
We gratefully acknowledge the use of code and data from the following projects: GRACE, EasyEdit, ROME, MEMIT
If you find our paper or code useful, we will greatly appreacite it if you could consider citing our paper:
@article{huang2024canknowledge,
title = {Can Knowledge Editing Really Correct Hallucinations?},
author = {Baixiang Huang and Canyu Chen and Xiongxiao Xu and Ali Payani and Kai Shu},
year = {2024},
journal = {arXiv preprint arXiv: 2410.16251}
}