(Korpus Gold Standard Dependency Treebank dalam Bahasa Indonesia)
We proposed revisions to UD Indonesian PUD Treebank provided by Universal Dependencies (UD) so that it conforms to Indonesian grammar.
Note: We donated this dataset to UD in 2020 and will maintain the dataset on the UD repository. Hence, the newest version of the dataset can be found there.
The short annotation guidelines for this revision can be found on the UD website.
2020-10-27 v2.0
- added lemma
- added features (14 features, 27 feature tags)
- revised MWE words annotation
- removed compound:prt
- UPOS correction for MWE words
- revised word segmentation
- for words ended with clitic -nya, especially for predicate nominalisation cases
- revised annotations of multiword token (MWT), especially for words ended with clitic -nya or particles lah/kah/tah/pun, including revising the annotation for SpaceAfter=No
- changed the UPOS:
- of personal pronouns for possessiveness from DET to PRON
- added and removed subtypes:
- nmod:lmod used for locative nouns
- renamed flat:range to just flat.
- renamed some flat tokens to flat:name (for PROPN-PROPN pairs)
2019-08-17 v1.0
- revised tokenization (major revision, especially reduplicated words)
- revised UPOS (major revision)
- proposed changes to language specific dependency relation for Indonesian
- revised syntactic annotation (major revision)
- Designing Indonesian annotation guidelines: Ika Alfina, Daniel Zeman, and Arawinda Dinakaramani
- Annotators: Ika Alfina, Arawinda Dinakaramani, Muhammad Yudistira Hanifmuti, Jessica Naraiswari Arwidarasti, Yogi Lesmana Sulestio
- Ika Alfina
- Arawinda Dinakaramani
- Ika Alfina, Daniel Zeman, Arawinda Dinakaramani, Indra Budi, and Heru Suhartanto. "Selecting the UD v2 Morphological Features for Indonesian Dependency Treebank". In the Proceeding of the 2020 International Conference of Asian Language Processing (IALP) in Kuala Lumpur, Malaysia, 4-6 Desember 2020.
- Ika Alfina, Arawinda Dinakaramani, Mohamad Ivan Fanany, and Heru Suhartanto. "A Gold Standard Dependency Treebank for Indonesian". In the Proceeding of 33rd Pacific Asia Conference on Language, Information and Computation (PACLIC) 2019 in Hakodate, Japan, 13-15 September 2019.
You can use this dataset for free. You don't need our permission to use it. Please cite our paper if your work uses our data in your publication. Please note that you are not allowed to create a copy of this dataset and share it publicly in your own repository without our permission.
ika.alfina [at] cs.ui.ac.id