This is a repository for the paper "Why Do Neural Language Models Still Need Commonsense Knowledge?" and "Impact of Co-occurrence on Factual Knowledge of Large Language Models" (EMNLP 2023 Findings) (project page).
Follow this to run the knowledge probing experiments.
This includes setting up a conda environment and knowledge probing datasets.
The dataset is saved in 'data/pile'.
bash scripts/installation/download_pile.sh
bash scripts/installation/extract_pile.sh
For other datasets, place them in 'data/{dataset_name}'.
The outputs are saved in 'data_statistics/entity_set'.
bash scripts/data_statistics/precompute/extract_entity_set.sh {dataset_names}
For example, run the following command to extract entities from LAMA_TREx and ConceptNet.
bash scripts/data_statistics/precompute/extract_entity_set.sh "LAMA_TREx ConceptNet"
The outputs are saved in 'data_statistics/term_document_index/{pretraining_dataset_name}'.
In addition to pretraining_dataset_name, the name of the text file needs to be specified as the script processes each data chunk individually when the dataset is split into multiple chunks.
# pretraining_dataset_name: ['pile', 'bert_pretraining_data']
bash scripts/data_statistics/precompute/compute_term_document_index.sh {pretraining_dataset_name} {filename}
The outputs are saved in 'data_statistics/cooccurrence_matrix/{pretraining_dataset_name}' and 'data_statistics/occurrence_matrix/{pretraining_dataset_name}'.
bash scripts/data_statistics/precompute/compute_cooccurrence_matrix.sh {pretraining_dataset_name} {filename}
bash scripts/data_statistics/precompute/aggregate_cooccurrence_matrix.sh {pretraining_dataset_name}
bash scripts/data_statistics/precompute/compute_occurrence_matrix.sh {pretraining_dataset_name} {filename}
bash scripts/data_statistics/precompute/aggregate_occurrence_matrix.sh {pretraining_dataset_name}
The prediction files are saved in 'results/{baseline_name}/{pretraining_dataset_name}'.
bash scripts/data_statistics/term_frequency_baselines/marginal_probability.sh {pretraining_dataset_name} {dataset_name}
bash scripts/data_statistics/term_frequency_baselines/joint_probability.sh {pretraining_dataset_name} {dataset_name}
bash scripts/data_statistics/term_frequency_baselines/PMI.sh {pretraining_dataset_name} {dataset_name}
Refer to ipython notebook for correlation analysis.
Refer to ipython notebook for analyzing the madeof relation.
Refer to ipython notebook for analyzing two opposite relations.