This repository contains the source code to replicate the creation of the transformer-based POS tagger model used in the paper Is Part-of-Speech Tagging a Solved Problem for Icelandic?.
The fine-tuned models that are the basis for the evaluation of the model configuration reported in the paper are also made available.
There are also scripts to train and evaluate models new models. See subsections.
If you find this work useful in your research, please cite the paper:
@inproceedings{
karason2023is,
title={Is Part-of-Speech Tagging a Solved Problem for Icelandic?},
author={{\"O}rvar K{\'a}rason and Hrafn Loftsson},
booktitle={The 24rd Nordic Conference on Computational Linguistics},
year={2023}
}
The fine-tuned models for four model configurarions and 10-folds are provided. There are .zip
files named A, B, C, and D that match models with same designations. The results for model C were reported in the paper (Table 2), and model D was used for comparison in the table here above (Table 3).
python ./train.py --name x --fold 1
Use --help
to see all parameter options.
python ./evaluate.py --model x --fold 1
Use --help
to see all parameter options.
Version 21.05 of MIM-GOLD needs to be downloaded from CLARIN-IS. The scripts assume it is in a subfolder (./MIM-GOLD-SETS.21.05/sets
), but the parameter --data
can be used to point to a different folder with the fold sets.
wget https://repository.clarin.is/repository/xmlui/bitstream/handle/20.500.12537/114/MIM-GOLD-SETS-21.05.zip
unzip -o "MIM-GOLD-SETS-21.05.zip"
The training script handles downloading the pre-trained jonfd/convbert-base-igc-is ConvBERT-base model from Hugging Face.