Skip to content

Universal Dependencies datasets preprocess and autodownloads.

License

Notifications You must be signed in to change notification settings

arthurdjn/udpos

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

udpos

Universal Dependencies (UD) datasets converter and autodownloads sources.

About

This repository contains UD preprocessed datasets for Part of Speech (POS) tasks. The datasets are converted from .conllu to .txt format. They can be downloaded using the torch.datasets.sequence_tagging.SequenceTaggingDataset class, just change the urls to the dataset of your choice.

Example

The datasets follows the orginal template from the Universal Dependencies English Treebank, available here. Create your custom UDPOS dataset using the language of your choice. Replace the urls and saving directory dirname to extract the downloaded files. That's it !

from torchtext.datasets import SequenceTaggingDataset

class UDPOSFR(SequenceTaggingDataset):
    # Universal Dependencies French Web Treebank.
    # Download original at http://universaldependencies.org/
    # License: http://creativecommons.org/licenses/by-sa/4.0/
    urls = ['https://github.com/arthurdjn/udpos/raw/master/data/fr-gsd-ud-15032020.zip'] # change to the dataset of your choice
    dirname = 'fr-gsd-ud'  # don't forget to change me too !
    name = 'udpos'         # not obligatory to change here

    @classmethod
    def splits(cls, fields, root=".data", 
               train="fr_gsd-ud-dev.txt",
               validation="fr_gsd-ud-dev.txt",
               test="fr_gsd-ud-dev.txt", **kwargs):
        """Downloads and loads the Universal Dependencies Version 2 POS Tagged
        data.
        """

        return super(UDPOSFR, cls).splits(
            fields=fields, root=root, train=train, validation=validation,
            test=test, **kwargs)

Then,

from torchtext import data
import UDPOSFR

TEXT = data.Field(lower = True)
LEMMATIZED = data.Field(unk_token = None)
UD_TAGS = data.Field(unk_token = None)
fields = (("text", TEXT), ("lemmatized", LEMMATIZED), ("udtags", UD_TAGS))

# Load the UD french dataset
train_data, eval_data, test_data = UDPOSFR.splits(fields)

Convert your own dataset

If you want to use a dataset that is not saved under the data/ folder, you can still use the conllu2txt(filename) function, in utils/. It will preprocess the original dataset into a ready-to-use one with PyTorch.

Source

Original datasets available at Universal Dependencies repository and on their official website. All rights reserved to Universal Dependencies.

About

Universal Dependencies datasets preprocess and autodownloads.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages