This repository contains text augmentation pipeline for the seed data of CNLI-TR.
CNLI-TR is a challenge dataset in Turkish created to assess natural language inference (NLI) abilities of language models. It contains sentence triplets: One potentially De Re De Dicto ambiguous sentence, one De Re paraphrase, and one De Dicto paraphrase.
The entirety of seed data is manually generated by trained linguists who are native speakers of Turkish.
The code and resources in this repository is used to augment the seed data to create a large, manually corrected NLI dataset.
List of Turkish given names (names.csv): This .csv file contains XX Turkish proper names scraped from web1. Gender2 and origin3 of each name is indicated in the corresponding column.
Turkish intensional operators list (): This .csv file contains XXX Turkish intensional operators4.
Seed data (annotated version): Seed data is manually generated by trained linguists who are native speakers of Turkish. It consists of sentence triplets: One potentially De Re De Dicto ambiguous sentence, one De Re paraphrase, and one De Dicto paraphrase.
Unique sentence id generator (id_generator.py): Each sentence in seed data set and augmented data set has a unique alphanumeric identifier. Sentence IDs consist of three letters followed by an underscore and a 5-digit number. The initial letter in IDs of seed data indicates the contributor that wrote the sentence. An algorithm that generates random numbers and strings was used to create these sentence IDs.
Augmentation pipeline (): The augmentation pipeline uses seed data to generate sentence triplets. Details will be revealed soon.
=== Machine-readable metadata ================================ Data available since: 11.2022 License: CC BY-SA 4.0 Includes text: yes Contributors: Marşan, Büşra; Atlamaz, Ümit; Demirok, Ömer; Kuzgun, Aslı; Oksal, Ceren; Doğan, Merve; Gök, Serra; Korkmaz, Arda Contact: busra.marsan@boun.edu.tr ===============================================================================
Footnotes
-
"f" for feminine, "m" for masculine, "u" for unisex. ↩
-
"ar" for Arabic, "ge" for Georgian, "gr" for Greek, "hb" for Hebrew, "mg" for Mongolian, "pr" for Persian, "tr" for Turkish, and "n/a" for unknown origins. Please note that some names were recorded as having two origins, i.e. ar-tr. ↩
-
intensional operator: Any expression O that combines with sentences φ to form well-formed expressions (usually sentences) Oφ and whose extension [[O]][^M,i] at an index i in a model M takes sentential intensions, i.e. functions from indices to truth values. (Wehmeier, K. F. (2018). Are quantifiers intensional operators?. Inquiry.) ↩