Impact of Co-occurrence on Factual Knowledge of Large Language Models (EMNLP 2023 Findings)

This is a repository for the paper "Why Do Neural Language Models Still Need Commonsense Knowledge?" and "Impact of Co-occurrence on Factual Knowledge of Large Language Models" (EMNLP 2023 Findings) (project page).

Installation

Knowledge Probing

Follow this to run the knowledge probing experiments.
This includes setting up a conda environment and knowledge probing datasets.

Download the Pre-training Data (the Pile - No Longer Available)

The dataset is saved in 'data/pile'.

bash scripts/installation/download_pile.sh
bash scripts/installation/extract_pile.sh

For other datasets, place them in 'data/{dataset_name}'.

Precompute Data Statistics

Extract Entities in the Target Datasets and Model Vocabularies

The outputs are saved in 'data_statistics/entity_set'.

bash scripts/data_statistics/precompute/extract_entity_set.sh {dataset_names}

For example, run the following command to extract entities from LAMA_TREx and ConceptNet.

bash scripts/data_statistics/precompute/extract_entity_set.sh "LAMA_TREx ConceptNet"

Compute Term Document Index of Entities

The outputs are saved in 'data_statistics/term_document_index/{pretraining_dataset_name}'.
In addition to pretraining_dataset_name, the name of the text file needs to be specified as the script processes each data chunk individually when the dataset is split into multiple chunks.

# pretraining_dataset_name: ['pile', 'bert_pretraining_data']
bash scripts/data_statistics/precompute/compute_term_document_index.sh {pretraining_dataset_name} {filename}

Compute Cooccurrence Matrix

The outputs are saved in 'data_statistics/cooccurrence_matrix/{pretraining_dataset_name}' and 'data_statistics/occurrence_matrix/{pretraining_dataset_name}'.

bash scripts/data_statistics/precompute/compute_cooccurrence_matrix.sh {pretraining_dataset_name} {filename}
bash scripts/data_statistics/precompute/aggregate_cooccurrence_matrix.sh {pretraining_dataset_name}

bash scripts/data_statistics/precompute/compute_occurrence_matrix.sh {pretraining_dataset_name} {filename}
bash scripts/data_statistics/precompute/aggregate_occurrence_matrix.sh {pretraining_dataset_name}