You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have noticed that punctuation marks apart from the hyphen - are not analyzed by LEMLAT, not even as unknown wordforms in the unk file (where "-" lands). However, when e.g. feeding a list of wordforms, one per line, to LEMLAT, it would be better to have the possibility to retrieve all of them in either of the two output files.
Also, LEMLAT automatically splits a string where a ' appears, creating two wordforms that are subsequently analyzed, without this being mentioned in the inline output message. there should be some option to change this behaviour and to make LEMLAT analyze each token as it is. Since this also happens with "." , it is very relevant for the treatment of abbreviations, which are very often tokenized as "T." or "F.", to distinguish them from the occurrences of the isolated letters "T" or "F".
The text was updated successfully, but these errors were encountered:
The provided application it is NOT intended to be in ANY WAY a text analyzer it is rather a batch processor of word-forms. Actually a quick-and-dirty tokenizzation (based on the separation on any non word character and '-') step it is performed but just for convenience, a full lemmatizzation pipeline it is always advisable!
Still you are right: it should be possible to analyze a string "as it is".
I have noticed that punctuation marks apart from the hyphen - are not analyzed by LEMLAT, not even as unknown wordforms in the unk file (where "-" lands). However, when e.g. feeding a list of wordforms, one per line, to LEMLAT, it would be better to have the possibility to retrieve all of them in either of the two output files.
Also, LEMLAT automatically splits a string where a ' appears, creating two wordforms that are subsequently analyzed, without this being mentioned in the inline output message. there should be some option to change this behaviour and to make LEMLAT analyze each token as it is. Since this also happens with "." , it is very relevant for the treatment of abbreviations, which are very often tokenized as "T." or "F.", to distinguish them from the occurrences of the isolated letters "T" or "F".
The text was updated successfully, but these errors were encountered: