Skip to content

Dicio's multilingual number parsing and formatting library, inspired by Mycroft's lingua-franca

License

Notifications You must be signed in to change notification settings

Stypox/dicio-numbers

Repository files navigation

Number parser and formatter for Dicio assistant

This multilingual library implements methods to extract numbers, dates or durations from text and format numbers, dates or dutations into human-readable words. It is inspired by Mycroft's lingua-franca, from which it borrows some resource files. Currently only Italian (it-it) and English (en-us) are supported, and methods to extract dates or durations are still unavailable (though formatting is).

This repository is part of the Dicio project. Also check out dicio-android, dicio-sentences-compiler and dicio-skill. Open to contributions :-D

Adding a language

You will need to translate some resource files, containing words but also regex, and then adapt some Java code, so be prepared for that.

First of all you need to obtain the language-country pair for the language you want to add. This is important so that who is using the library can choose the language to use. For example, English is en-us and Italian is it-it. Let's call this LANGUAGE_COUNTRY from now on.

Resources

Copy the whole folder numbers/src/main/resources/config/en-us into a new folder numbers/src/main/resources/config/LANGUAGE_COUNTRY. All of the resource files in the new folder should be translated into the new language. DO NOT rename any file, just edit their contents.

*.word files

The files named like ENGLISH_WORD.word should contain only one line with the lowercase translation of the word ENGLISH_WORD in the new language. For example, the English second.word contains "second", the Italian one contains "secondo" and the new language one should contain the translation of "second" into that language. These files are also present in Mycroft's lingua-franca, so copy them from there to save time! Then just check if everything is fine.

date_time.json file

This file contains the data needed to properly format dates, and in part also times, though that's also handled elsewhere. This file is also present in Mycroft's lingua-franca, so copy it from there to save time! Then just check if everything is fine.

Each of the formats provided in this file can (and shall) contain references to other already-formatted strings to be substituted, in the form {FORMATTED_STRING_NAME}. For example, English uses "{formatted_date} at {formatted_time}" as the way to format together both the date and the time.

"weekday", "date", "month", "number"

These JSON objects should contain:

  • "weekday": a numbered list of how the 7 days in the week are pronounced, where monday is at position 0
  • "date": a numbered list of how the at-most-31 days in the month are pronounced, where the first day is at position 1
  • "month": a numbered list of how the 12 months are pronounced, where january is at position 1
  • "number": a list of "NUMBER": "PRONOUNCIATION" pairs. It should contain the pronounciation of all numbers that you want to use in "decade_format", "hundreds_format" and "thousand_format".

"decade_format", "hundreds_format", "thousand_format"

These JSON objects should contain a numbered list (starting from 1, but it doesn't matter) of (regex, format) pairs, along with a "default" format for when none of the regexes match. Each (regex, format) pair should be a JSON object:

  • "match" contains the JSON-escaped pattern to match, starting with ^ and ending with $. The pattern will be matched against the part of the year corresponding to "decade_format" (only last two digits), "hundreds_format" (last three digits) or "thousand_format" (last four digits).
  • "format" (and the "default" mentioned before) contains the format to apply. Refer to the table below for which substitutions you can do.

In these JSON object you should put partial year formatting results that will be finally used by "year_format". It doesn't matter if for some language, instead of just formatting e.g. the thousands digit in "thousand_format", you also sometimes format the hundreds digit (this is what happens for English). The important thing is that, at the end, "year_format" spits out correct results.

This is the table of possible FORMATTED_STRING_NAMEs you can use when having to do with "decade_format", "hundreds_format" or "thousand_format". Check out NiceYearSubstitutionTableBuilder.java. The examples are in English and relative to the years "2019" and "3865 b.c.".

FORMATTED_STRING_NAME Explanation e.g. "2019" e.g. "-3865"
x the last digit of the year "nine" "five"
xx the last two digits of the year, if there is a corresponding entry in the "number" list "nineteen" "" (65 is not special and thus has no entry in the "number" list)
x0 the tens of the year "ten" "sixty"
x_in_x0 the tens digit of the year "one" "six"
xxx the last three digits of the year, if there is a corresponding entry in the "number" list "nineteen" (there is no hundreds digit) "" (865 is not special and thus has no entry in the "number" list)
x00 the hundreds of the year, if there is a corresponding entry in the "number" list "zero" (probably doesn't make much sense) "" (English pronounces hundreds by putting "hundred" after the unit, so there is no entry for 800 in the "number" list)
x_in_x00 and x_in_0x00 the hundreds digit of the year "zero" "eight"
xx00 the year with its tens and its units digits set to 0, if there is a corresponding entry in the "number" list "" (2000 is not special and thus has no entry in the "number" list) "" (3800 is not special and thus has no entry in the "number" list)
xx_in_xx00 the year divided by 100, if there is a corresponding entry in the "number" list "twenty" "" (38 is not special and thus has no entry in the "number" list)
x000 the thousands of the year, if there is a corresponding entry in the "number" list "" (English pronounces thousands by putting "thousand" after the unit, so there is no entry for 2000 in the "number" list) "" (same reason, for 3000)
x_in_x000 the thousands digit of the year "two" "three"
x0_in_x000 the thousands digit of the year, multiplied by 10 "twenty" "thirty"
number the non-formatted part of the year corresponding to "decade_format" (only last two digits), "hundreds_format" (last three digits) or "thousand_format" (last four digits), to be used for "default" as a fallback "19", "19" or "2019" "65", "865" or "3865"

"year_format"

This JSON object follows the same structure as "decade_format", "hundreds_format", "thousand_format", but there is also a "bc" field that should contain the translation of the shortened "Before Christ" ("b.c."). In this JSON object you should put how to fully format a number as a year, using the formatted strings already calculated using "decade_format", "hundreds_format" and "thousand_format".

The formats have at their disposal the full table from above plus the following items (which are the ones that should actually be used).

FORMATTED_STRING_NAME Explanation e.g. "2019" e.g. "-3865"
formatted_decade the decade formatted using "decade_format" "nineteen" "sixty five"
formatted_hundreds the hundreds formatted using "hundreds_format" "zero hundred" (yeah, it doesn't make sense) "eight hundred"
formatted_thousand the thousands formatted using "thousand_format" "twenty" "thirty eight"
bc the translation of "b.c." if the year is before Christ, otherwise an empty string "" "b.c."
number the non-formatted full-year, to be used for "default" as a fallback "2019" "3865"

"date_format"

This JSON object should contain a format in these fields: "date_full", "date_full_no_year", "date_full_no_year_month"; and a translation of the field name in these fields: "today", "tomorrow", "yesterday".

The formats have at their disposal this limited table.

FORMATTED_STRING_NAME Explanation e.g. "Tuesday, 2022/05/03"
day the name of the day in the month "third"
weekday the name of the day in the week "tuesday"
month the name of the month "may"
formatted_year the year fully formatted using "year_format" "twenty twenty two"

"date_time_format"

This JSON object should contain a format in this only field: "date_time".

The format has at its disposal this limited table.

FORMATTED_STRING_NAME Explanation e.g. "Tuesday, 2022/05/03 13:22"
formatted_date the date fully formatted using "date_format" "tuesday, may second, twenty twenty two"
formatted_time the time formatted using the java method niceTime "one twenty two p.m."

tokenizer.json

This JSON file contains the information the tokenizer uses to generate the token stream corresponding to an input string.

  • "spaces": a string containing all characters that should be considered spaces. This usually does not need to be translated.
  • "characters_as_word": a string containing all characters that should be considered as single-character words, and not as letters possibly part of a bigger word. An example for this is %, since the percent sign has a meaning on its own.
  • "compound_word_piece_category": the category that you will give in "number_mappings" to words that can be part of a compound word. A compound word is a word made of other words all connected together, and the tokenizer will split such big words using only words that can actually be part of compound words. While the value of this could be arbitrary, given that you then use the same value in "number_mappings", there is no reason to set this to something different than "compound_word_piece", which is what is used in Italian. If the language you want to add has no compound words, like English, just don't add this field.
  • "raw_number_categories": an array of all the categories that the tokenizer should give to raw numbers when it encounters them. A raw number is e.g. 2384, -392, ...
  • "plural_endings": an array of the endings that the tokenizer should try to trim off a word if it can't recognize it otherwise. For example, in English "tens" would match with "ten" because there is "s" in the "plural_endings" array.
  • "word_matches": an array of JSON objects of this form:
    • "categories": an array of the categories that should be assigned to the words that match any of the values
    • "values": an array of the words that should be assigned all of the categories
  • "number_mappings": an array of JSON objects of this form:
    • "categories": an array of the categories that should be assigned to the words that match any of the values
    • "values": an JSON object with the words that should be assigned all of the categories paired with their corresponding (integer or decimal) numerical value
  • "duration_words": a JSON object used to pair words with their corresponding duration. The keys in the object should be the durations (formatted as "number UNIT", where number is an integer and UNIT (uppercase) is one of NANOS, MICROS, MILLIS, SECONDS, MINUTES, HOURS, HALF_DAYS, DAYS, WEEKS, MONTHS, YEARS, DECADES, CENTURIES, MILLENNIA, ERAS), and the values are JSON arrays of words.
  • "duration_restrict_after_number": a list of all of the words present in "duration_words" that should not be recognized as a duration if they don't come with a number before them in the input. For example, "hello ms" should not be interpreted as "hello (1 millisecond)", while "hello millisecond" and "hello 1 ms" should.

Test resources

Copy the whole folder numbers/src/test/resources/config/en-us into a new folder numbers/src/test/resources/config/LANGUAGE_COUNTRY. All of the resource files in the new folder are used for testing purposes and should be translated into the new language. DO NOT rename any file, just edit their contents.

date_time_test.json

This file contains some tests for the date_time.json file. This file is also present in Mycroft's lingua-franca, so copy it from there to save time! Then just check if everything is fine.

Each of the JSON objects described below contains a numbered list of tests to run.

"test_nice_year"

These tests are for "year_format". Each test has:

  • "datetime_param" the datetime in this form: year, month, day, hour, minute, second
  • "bc" whether the datetime is before Christ (True) or not (False)
  • "assertEqual" what the provided datetime should be formatted as

"test_nice_date"

These tests are for "date_format". Each test has:

  • "datetime_param" the datetime in this form: year, month, day, hour, minute, second
  • "now" the datetime corresponding to the "now" time for which to test relative dates (e.g. if datetime_param=now, then today will be outputted, not the full date), in the same form, or None if "now" is unknown or unwanted
  • "assertEqual" what the provided datetime should be formatted as

"test_nice_date_time"

These tests are for "date_time_format". Each test has:

  • "datetime_param" the datetime in this form: year, month, day, hour, minute, second
  • "now" the datetime corresponding to the "now" time for which to test relative dates (e.g. if datetime_param=now, then today will be outputted, not the full date), in the same form, or None if "now" is unknown or unwanted
  • "use_24hour" whether to use the 24-hour format (True) or use the 12-hour one (False) (parameter passed to the java method niceTime)
  • "use_ampm" whether to show AM/PM or not (parameter passed to the java method niceTime)
  • "assertEqual" what the provided datetime should be formatted as