Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The preprocessing pipeline used for training #8

Open
alexeyev opened this issue Nov 11, 2022 · 4 comments
Open

The preprocessing pipeline used for training #8

alexeyev opened this issue Nov 11, 2022 · 4 comments

Comments

@alexeyev
Copy link

Dear colleague,

thank you for your work!

May I wonder -- what is the right way to use the lemmatizer/PoS tagger? Which pie tokenizer or other preprocessing steps should be used (for best quality)?

Here's my minimal working example. Is this exactly the same pipeline you have used on training and evaluation stages?

# coding: utf-8

from pie.tagger import Tagger
from pie.tagger import simple_tokenizer
from pie.utils import model_spec

device, batch_size, model_file = "cpu", 4, "../models/lasla-plus-lemma.tar"
data = "Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. " \
       "Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."

tagger = Tagger(device=device, batch_size=batch_size)

for model, tasks in model_spec(model_file):
    tagger.add_model(model, *tasks)

sents, lengths = [], []

for sentence in simple_tokenizer(data):
    sents.append(sentence)
    lengths.append(len(sentence))

tagged, tasks = tagger.tag(sents=sents, lengths=lengths)

print("Tagged:", tagged)
print("Tasks:", tasks)

Thank you in advance.

Best regards,
Anton.

@PonteIneptique
Copy link
Owner

PonteIneptique commented Nov 11, 2022 via email

@PonteIneptique
Copy link
Owner

If you just wish to tag, your best shot is https://github.com/hipster-philology/nlp-pie-taggers where I introduced all of the preprocessing AND the post-processing (specifically for enclitics like -que, -ve, -ne)

Preprocessing (on the top of my head):

@alexeyev
Copy link
Author

Hi, thank you for the swift response!

I'm afraid I am going to abuse your kindness once again and ask a few more questions a bit later -- after I take a closer look at nlp-pie-taggers. Thanks!

@PonteIneptique
Copy link
Owner

Sure, feel free !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants