- adds more ossrh sync data to maven pom
- minor code cleanups
- some API documentation added
- Updated dependencies
- Minimum Java version raised to 17
- Fixed group id in pom.xml
- Removed compile dependency on Maven Surefire
- Build artifacts in src/main/jflex are now ignored by git
- java.io's ByteArrayOutputStream used instead of 3rd-party class
- Bug fix: a single quotation mark at the beginning of a word is no longer interpreted as a beginning of an omission, but as quotation mark token.
- dependencies updated
- "du." is no longer treated as an abbreviation.
- "Dir." and "dir." are no longer treated as abbreviations.
- Apostrophe and hyphen marked contractions and clitics in English (I've, isn't, Peter's, …) and French (j'ai, d'un, l'art, sont-elles, …) are now separated.
- GitHub CI test workflow added
- Dependencies updated
-Xss2m
added to maven jvm config
--sentence-boundaries|-s
now prints sentence boundaries only if--positions|-p
is also present
- Dependencies updated
- Tokenizer and sentence splitter for English (
-l en
option) added - Tokenizer and sentence splitter for French (
-l fr
option) added - Support for adding more languages
UTF-8
input encoding is now expected by default, different encodings can be set by the--encoding <enc>
option- By default, tokens are now printed to stdout (use options
--no-tokens --positions
to print character offsets instead) - Abbreviated German street names like Kunststr. are now recognized as tokens
- Added heuristics for distinguishing between I. as abbrevation vs PPER / CARD
- URLs without URI-scheme are now recognized as single tokens if they start wit
www.
- Standard EOT/EOF character x04 is used instead of magic escape \n\x03\n
- Quoted email names containing space characters, like "John Doe"@xx.com, are no longer interpreted as single tokens
- Sentence splitter functionality added (
--sentence-boundaries
option)
- First version published on https://korap.ids-mannheim.de/gerrit/plugins/gitiles/KorAP/KorAP-Tokenizer
- Extracted from KorAP-internal ingestion pipeline