Transformer-based POS tagger for Icelandic

This repository contains the source code to replicate the creation of the transformer-based POS tagger model used in the paper Is Part-of-Speech Tagging a Solved Problem for Icelandic?.

The fine-tuned models that are the basis for the evaluation of the model configuration reported in the paper are also made available.

There are also scripts to train and evaluate models new models. See subsections.

If you find this work useful in your research, please cite the paper:

@inproceedings{
    karason2023is,
    title={Is Part-of-Speech Tagging a Solved Problem for Icelandic?},
    author={{\"O}rvar K{\'a}rason and Hrafn Loftsson},
    booktitle={The 24rd Nordic Conference on Computational Linguistics},
    year={2023}
}

Trained models

The fine-tuned models for four model configurarions and 10-folds are provided. There are .zip files named A, B, C, and D that match models with same designations. The results for model C were reported in the paper (Table 2), and model D was used for comparison in the table here above (Table 3).

Download models from Dropbox

Training

python ./train.py --name x --fold 1

Use --help to see all parameter options.

Evaluating

python ./evaluate.py --model x --fold 1

Use --help to see all parameter options.

Data

Version 21.05 of MIM-GOLD needs to be downloaded from CLARIN-IS. The scripts assume it is in a subfolder (./MIM-GOLD-SETS.21.05/sets), but the parameter --data can be used to point to a different folder with the fold sets.

wget https://repository.clarin.is/repository/xmlui/bitstream/handle/20.500.12537/114/MIM-GOLD-SETS-21.05.zip
unzip -o "MIM-GOLD-SETS-21.05.zip"

The training script handles downloading the pre-trained jonfd/convbert-base-igc-is ConvBERT-base model from Hugging Face.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
_static		_static
.gitignore		.gitignore
README.md		README.md
common.py		common.py
evaluate.py		evaluate.py
requirementst.txt		requirementst.txt
train.py		train.py
vocab.txt		vocab.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Transformer-based POS tagger for Icelandic

Trained models

Training

Evaluating

Data

About

Releases

Languages

orvark13/postr

Folders and files

Latest commit

History

Repository files navigation

Transformer-based POS tagger for Icelandic

Trained models

Training

Evaluating

Data

About

Topics

Resources

Stars

Watchers

Forks

Releases

Languages