The preprocessing pipeline used for training #8

alexeyev · 2022-11-11T12:43:00Z

Dear colleague,

thank you for your work!

May I wonder -- what is the right way to use the lemmatizer/PoS tagger? Which pie tokenizer or other preprocessing steps should be used (for best quality)?

Here's my minimal working example. Is this exactly the same pipeline you have used on training and evaluation stages?

# coding: utf-8

from pie.tagger import Tagger
from pie.tagger import simple_tokenizer
from pie.utils import model_spec

device, batch_size, model_file = "cpu", 4, "../models/lasla-plus-lemma.tar"
data = "Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. " \
       "Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."

tagger = Tagger(device=device, batch_size=batch_size)

for model, tasks in model_spec(model_file):
    tagger.add_model(model, *tasks)

sents, lengths = [], []

for sentence in simple_tokenizer(data):
    sents.append(sentence)
    lengths.append(len(sentence))

tagged, tasks = tagger.tag(sents=sents, lengths=lengths)

print("Tagged:", tagged)
print("Tasks:", tasks)

Thank you in advance.

Best regards,
Anton.

The text was updated successfully, but these errors were encountered:

PonteIneptique · 2022-11-11T17:54:59Z

Hi!just a quick answer to show that I check the issues. I'll get back to you asap Le ven. 11 nov. 2022 à 1:43 PM, Anton Alekseev ***@***.***> a écrit :

…

Dear colleague, thank you for your work! May I wonder -- what is the right way to use the lemmatizer/PoS tagger? Which pie tokenizer or other preprocessing steps should be used (for best quality)? Here's my minimal working example. Is this *exactly the same pipeline* you have used on training and evaluation stages? # coding: utf-8 from pie.tagger import Taggerfrom pie.tagger import simple_tokenizerfrom pie.utils import model_spec device, batch_size, model_file = "cpu", 4, "../models/lasla-plus-lemma.tar"data = "Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. " \ "Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum." tagger = Tagger(device=device, batch_size=batch_size) for model, tasks in model_spec(model_file): tagger.add_model(model, *tasks) sents, lengths = [], [] for sentence in simple_tokenizer(data): sents.append(sentence) lengths.append(len(sentence)) tagged, tasks = tagger.tag(sents=sents, lengths=lengths) print("Tagged:", tagged)print("Tasks:", tasks) Thank you in advance. Best regards, Anton. — Reply to this email directly, view it on GitHub <#8>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAOXEZVPWXFSMQYKDIUKRY3WHY5OBANCNFSM6AAAAAAR5SD7ZY> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

PonteIneptique · 2022-11-12T08:32:54Z

If you just wish to tag, your best shot is https://github.com/hipster-philology/nlp-pie-taggers where I introduced all of the preprocessing AND the post-processing (specifically for enclitics like -que, -ve, -ne)

Preprocessing (on the top of my head):

V,v -> U, u
J,j -> I, i
Roman numerals -> 0, 1, 2, 3 (Number > 3 = 3)
No punctuation (not preprocessing per se, just no punctuation in the original training data)
Enclitics are glued in lemma with the previous lemma using 界 (Link
No Greek (https://github.com/hipster-philology/nlp-pie-taggers/blob/master/pie_extended/models/lasla/tokenizer.py)

alexeyev · 2022-11-14T03:27:40Z

Hi, thank you for the swift response!

I'm afraid I am going to abuse your kindness once again and ask a few more questions a bit later -- after I take a closer look at nlp-pie-taggers. Thanks!

PonteIneptique · 2022-11-14T06:57:19Z

Sure, feel free !

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The preprocessing pipeline used for training #8

The preprocessing pipeline used for training #8

alexeyev commented Nov 11, 2022

PonteIneptique commented Nov 11, 2022 via email

PonteIneptique commented Nov 12, 2022

alexeyev commented Nov 14, 2022

PonteIneptique commented Nov 14, 2022

The preprocessing pipeline used for training #8

The preprocessing pipeline used for training #8

Comments

alexeyev commented Nov 11, 2022

PonteIneptique commented Nov 11, 2022 via email

PonteIneptique commented Nov 12, 2022

alexeyev commented Nov 14, 2022

PonteIneptique commented Nov 14, 2022