An End-to-End Entity Linking Dataset of Cultural Heritage Objects
- To download the model files (2.61 GB), execute the following bash line:
$ bash src/download_model_files.sh
docker
Pull the following image:
$ docker pull c1587s/aat_elq_entity_linking:latest
Use the image in a container with shared volume:
$ docker run --gpus all --rm -it --name elq_cont -v $(pwd):/home/shared_volume c1587s/aat_elq_entity_linking
- conda environment
Create and activate the environment:
$ conda create -n elqel -y python=3.7 && conda activate elqel
Install requirements:
$ pip install -r ./requirements.txt
To produce predictions, ELQ requires text inputs as lists of dictionaries in python-format, as in the example provided below:
text_to_link = [{
"id": "BM-A_1936-1012-44",
"text": "Figure (woman) wearing Rainbow Dance costume. \
Made of red, black, blue, gold painted earthenware.".lower(),
},
{
"id": "BM-A_1940-0716-13",
"text": "Figure (Gaṇesa). Folk deity,seated feet crossed holding 2 lotuses. \
Made of bronze.".lower(),
}
]
Once input data is properly formatted, predictions can be obtained as follows:
import os
from elqel.entity_linking import ELQEntityLinker
biencoder_path = "./biencoder/pytorch_model.bin"
models_path = "./models/"
# instantiate model
elq_model = ELQEntityLinker(models_path=models_path,
biencoder_path=biencoder_path,
prediction_type="unique")
# predict
predictions = elq_model.entity_linking(data_to_link_example)
Output format
You may have noticed the prediction_type
parameter when instantiating the model from the above example. This parameter allows you to indicate whether the candidates from the “entity disambiguation step” must be returned
prediction_type=='unique'
Using ‘unique’ as prediction_type in ELQEntityLinker will produce prediction raw outputs, only including the information for the best candidate found by the model using the following fields:
-id (str): unique text identifier
-pred_triples (list of tuples): Each tuple contains the knowledge base ID, and indicates the tagged item using start and end tokens
-pred_tuples_string (list of lists): Each sublist contains linked aat title and tagged item (str), respectively
-scores (list of floats): List of scores for the best candidate
-text (str): text being annotated
-tokens tokens (list of ints): tokenized text
Output Example:
{'id': 'BM-A_1936-1012-44',
'pred_triples': [('31871', 8, 9),
('30731', 19, 20),
('30193', 20, 23),
('64866', 1, 2)],
'pred_tuples_string': [['costume', 'costume'],
['painted', 'painted'],
['earthen ware', 'earthenware'],
['human figures', 'figure']],
'scores': [8.12,
5.84,
5.38,
4.93],
'text': 'figure (woman) wearing rainbow dance costume. made of red, black, '
'blue, gold painted earthenware.',
'tokens': [101,
...
102]}
prediction_type=='multiple'
Using multiple
as prediction_type
in ELQEntityLinker
will produce prediction raw outputs, including the information for the bests candidates found by the model using the fields defined above.
- To convert the prediction results to a DataFrame, the
preds2dataframe()
method can be used as follows:
predictions_df = elq_model.preds2dataframe(save_path = "annotations/predictions_df.csv")
predictions_df
Output Example:
id | text | chunk_text | chunk_start | chunk_end | aat |
---|---|---|---|---|---|
BM-A_1936-1012-44 | figure (woman) wearing rainbow dance costume. made of red, black, blue, gold painted earthenware. | figure | 0 | 6 | 300404114 |
BM-A_1936-1012-44 | figure (woman) wearing rainbow dance costume. made of red, black, blue, gold painted earthenware. | costume | 37 | 44 | 300178802 |
BM-A_1936-1012-44 | figure (woman) wearing rainbow dance costume. made of red, black, blue, gold painted earthenware. | painted | 77 | 84 | 300161986 |
BM-A_1936-1012-44 | figure (woman) wearing rainbow dance costume. made of red, black, blue, gold painted earthenware. | earthenware | 85 | 96 | 300140803 |
BM-A_1940-0716-13 | figure (ganesa). folk deity,seated feet crossed holding 2 lotuses. made of bronze. | figure | 0 | 6 | 300189808 |
BM-A_1940-0716-13 | figure (ganesa). folk deity,seated feet crossed holding 2 lotuses. made of bronze. | feet | 35 | 39 | 300310200 |
BM-A_1940-0716-13 | figure (ganesa). folk deity,seated feet crossed holding 2 lotuses. made of bronze. | bronze | 75 | 81 | 300010957 |
As this dataset is composed of annotations from several museum collections, its license is composed of different license terms. The "museum" column allows us to identify to which museum the data descriptions belong, and the license that governs them is listed below.
The annotations of the following museums have a CC0 license:
- AIC (The Art Institute of Chicago)
- CMA (Cleveland Museum of Art)
- MET (The Metropolitan Museum of Art
- WCMA (Williams College Museum of Art)
- SMITH (Smithsonian National Museum))
Annotations pertaining to the museums listed below are licensed under a CC BY-NC-SA 4.0 license:
- BM (British Museum)
- PAHMA (Phoebe A. Hearst Museum of Anthropology)
The annotations of the museums listed below are licensed under a CC BY-NC-SA 3.0 license:
- PENN (Penn Museum)
The annotations belonging to the museums listed below are licensed under a CC 0.1.0 licence:
The use of the ELQ-fine-tuned model is licensed under an MIT license