WDV is a dataset for the verbalisation of Wikidata triples. It is thoroughly described in the paper that accompanies it.
It consists of a large partially annotated dataset of over 7.6k entries that align a broad collection of Wikidata claims with their respective verbalisations.
The attributes seen in each entry consist of: attributes describing the claim, such as its Wikidata ID (claim id ) and its rank (normal, deprecated or preferred); attributes from the claim’s components (subject, predicate, and object), including their Wikidata IDs (e.g. subject id ), labels (e.g. subject label ), descriptions (e.g. subject desc), and aliases (e.g. subject alias); a JSON representation of the object alongside its type (object datatype) as defined by Wikidata; attributes from the claim’s theme such as its root class’ Wikidata ID (theme root class id) and label (theme label); the aligned verbalisation, before and after replacement of tokens unknown to the model (verbalisation unk replaced ); the sampling weight from the stratified sampling process; and the crowdsourced annotations and their aggregations, for those entries (∼1.4k) that are annotated.
WDV is a 3 star dataset according to the 5 star deployment scheme for Linked Data. It is available on the web in a structured, machine-readable, and non-proprietary format. WDV is aimed at directly helping with managing reference quality in Wikidata by allowing us to close the gap in form between the data in the KG and the data in its sources. It has already made possible efforts towards automated fact verification in Wikidata.
This repository contains all the data and scripts used in the construction of WDV. It is structured as follows:
- WikidataClaims: This folder contains the scripts that parse Wikidata dumps and assemble the stratified sample of Wikidata claims for all three partitions used in the study (WebNLG_SEEN, WebNLG_UNSEEN, WD_UNSEEN).
- Verbalisation: This folder contains the scripts and model used in the verbalisation of the claims obtained from the WikidataClaims scripts.
- Crowdsourcing: This folder contains all scripts, html templates, crowdsourcing results and data artefacts from the crowdsourcing done to measure fluency and adequacy of verbalisations.
There are individual README.md files inside each folder for more detailed descriptions of their contents.
The authors re-organised the file structure of this repository, so some file paths used in the scripts might need some adjusting. Additionally, some intermediate files generated during this project were too big to store in this repository, but can be re-generated by running the scripts found here. We encourage anyone who wants these files either to contact us or (preferably) to run the scripts in order to obtain updated versions of these by-products. This includes the dumps used at the very start of this whole data creation. These can be obtained by downloading the latest latest-all.json.bz2 file.
You can check our paper here: link to paper at ArXiv
You can also find the data at figshare here: link to data at Figshare
The Dataset:
@article{WDVAmaral2022,
author = "Gabriel Amaral",
title = "{WDV}",
year = "2022",
month = "5",
url = "https://figshare.com/articles/dataset/WDV/17159045",
doi = "10.6084/m9.figshare.17159045.v1"
}
The Paper:
@misc{https://doi.org/10.48550/arxiv.2205.02627,
doi = {10.48550/ARXIV.2205.02627},
url = {https://arxiv.org/abs/2205.02627},
author = {Amaral, Gabriel and Rodrigues, Odinaldo and Simperl, Elena},
keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
title = {WDV: A Broad Data Verbalisation Dataset Built from Wikidata},
publisher = {arXiv},
year = {2022},
copyright = {Creative Commons Zero v1.0 Universal}
}
For more details and questions, contact Gabriel at mailto:gabriel.amaral@kcl.ac.uk