Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Treatment of punctuation #19

Open
Stormur opened this issue Feb 20, 2019 · 1 comment
Open

Treatment of punctuation #19

Stormur opened this issue Feb 20, 2019 · 1 comment
Assignees

Comments

@Stormur
Copy link
Member

Stormur commented Feb 20, 2019

I have noticed that punctuation marks apart from the hyphen - are not analyzed by LEMLAT, not even as unknown wordforms in the unk file (where "-" lands). However, when e.g. feeding a list of wordforms, one per line, to LEMLAT, it would be better to have the possibility to retrieve all of them in either of the two output files.

Also, LEMLAT automatically splits a string where a ' appears, creating two wordforms that are subsequently analyzed, without this being mentioned in the inline output message. there should be some option to change this behaviour and to make LEMLAT analyze each token as it is. Since this also happens with "." , it is very relevant for the treatment of abbreviations, which are very often tokenized as "T." or "F.", to distinguish them from the occurrences of the isolated letters "T" or "F".

@gersh0m
Copy link
Member

gersh0m commented Jun 10, 2019

The provided application it is NOT intended to be in ANY WAY a text analyzer it is rather a batch processor of word-forms. Actually a quick-and-dirty tokenizzation (based on the separation on any non word character and '-') step it is performed but just for convenience, a full lemmatizzation pipeline it is always advisable!
Still you are right: it should be possible to analyze a string "as it is".

@gersh0m gersh0m self-assigned this Jun 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants