This script generates a Unicode character lookup table with linear access time. It creates a header and source file and compiles a static library usable within C/C++. The source data is contained in the files UnicodeData.txt and SpecialCasing.txt and can be found on https://www.unicode.org/Public/UNIDATA/. Currently Unicode version 15.0.0 is used, but the files can be replaced with newer versions in the future.
The only function is UTLookupGlyph
that looks up a single character by its Unicode value. It returns a pointer to a UTInfo
struct containing the character informations. It always returns a valid pointer, even for invalid characters. In this case, the field category
has the value UT_CATEGORY_INVALID
assigned.
UTInfo
contains the following fields:
typedef struct {
uint32_t flags; ///< Combination of UTFlag.
uint32_t category; ///< One of UTCategory.
int32_t cases[3]; ///< Distance to case variant. Indexable with UTCase.
union {
int64_t num; ///< Number value if `flags & UT_FLAG_NUMBER`.
char const* frac; ///< Fraction string if `flags & UT_FLAG_FRACTION`.
};
} UTInfo;
category
contains one of the character categories listed inUTCategory
flags
contains multiple flags listed inUTFlag
cases
contains values to be added to the character value in order to convert it to the desired case variant (UT_CASE_UPPER
,UT_CASE_LOWER
orUT_CASE_TITLE
). The field is indexable withUTCase
. If acase
field is0
, that specific case variant does not exist or is the same case variant as the character value itself. If one of the flagsUT_FLAG_UPPER_EXPANDS
,UT_FLAG_LOWER_EXPANDS
orUT_FLAG_TITLE_EXPANDS
is set inflags
, the character expands to multiple characters when case-folding. For example, the lowercase letter "ß" (0x00DF; LATIN SMALL LETTER SHARP S
) expands to the 2 uppercase letters "SS" (0x0053 0x0053; LATIN CAPITAL LETTER S
).cases
then contains an index usable for the arrayUTSpecialCases
. The index itself points to the number of character in the expanded sequence. The following array elements contain the expanded sequence's character values (see example below).number
contains numeric values for digits, number-like and fraction characters. For example, the roman number "Ⅶ" (0x2166; ROMAN NUMERAL SEVEN
) has the value7
innum
. Fractions are represented by strings that contain the nominator and denominator separated by/
("n/d"
). For example, the fraction character "¼" (0x00BC; VULGAR FRACTION ONE QUARTER
) has the value"1/4"
infrac
.
Looking up a character with its Unicode value:
// character `Đ` (0x0110; LATIN CAPITAL LETTER D WITH STROKE)
UTGlyph glyph = 0x0110;
UTInfo const* info = UTLookupGlyph(glyph);
// get lowercase variant `đ` (0x0111; LATIN SMALL LETTER D WITH STROKE)
UTGlyph lower = glyph + info->cases[UT_CASE_LOWER];
// prints "Lowercase variant of 0x0110: 0x0111"
printf("Lowercase variant of 0x%04X: 0x%04X\n", glyph, lower);
Get the representing integer or fraction value:
UTGlyph glyph;
UTInfo const* info;
// character `Ⅶ` (0x2166; ROMAN NUMERAL SEVEN)
glyph = 0x2166;
info = UTLookupGlyph(glyph);
// check if character is a number
if (info->flags & UT_FLAG_NUMBER) {
// prints "Integer value of 2166: 7"
printf("Integer value of %04X: %lld\n", glyph, info->num);
}
// character `¼` (0x00BC; VULGAR FRACTION ONE QUARTER)
glyph = 0x00BC;
info = UTLookupGlyph(glyph);
// check if character is a fraction
if (info->flags & UT_FLAG_FRACTION) {
// prints "String representation of 0x00BC: 1/4"
printf("String representation of 0x%04X: %s\n", glyph, info->frac);
}
Handling cases, where case-folding expands to multiple characters:
// character `ß` (0x00DF; LATIN SMALL LETTER SHARP S)
UTGlyph glyph = 0x00DF;
UTInfo const* info = UTLookupGlyph(glyph);
// check if expansion occurs to prevent invalid index
if (info->flags & UT_FLAG_UPPER_EXPANDS) {
// sequence index for uppercase variant
int idx = info->cases[UT_CASE_UPPER];
int length = UTSpecialCases[idx];
// character sequence
UTGlyph const* sequence = &UTSpecialCases[idx + 1];
// prints "0x00DF expands to 2 chars in uppercase"
printf("0x%04X expands to %d chars in uppercase\n", glyph, length);
// uppercase characters
// prints:
// "0: 0x0053"
// "1: 0x0053"
for (int i = 0; i < length; i ++) {
printf("%d: 0x%04X\n", i, sequence[i]);
}
}
else {
printf("Character %04X does not expand\n", glyph);
}
Run ./autogen.sh
to generate the build system. Then run ./configure
and make
to build libunicodetable.a
in the src
directory:
./autogen.sh
./configure
make
configure
can take some options:
--enable-symbol-prefix=bla
changes the prefix for library symbols tobla
. The default isUT
.--enable-snake-case
,--disable-snake-case
enables or disables snake-case symbol names (For example,bla_lookup_glyph
instead ofblaLookupGlyph
). The default isdisable
.--enable-categories=Lu,Ll,Lt,Lm,Lo,Mn,Mc,Me,Nd,Nl,No,Pc,Pd,Ps,Pe,Pi,Pf,Po,Sm,Sc,Sk,So,Zs,Zl,Zp,Cc,Cf,Cs,Co,Cn
sets the required Unicode character categories to be include in the table. This can reduce the table size. All other characters will have their category set toUT_CATEGORY_OTHER_NOT_ASSIGNED
. If omitted, all categories are included.--enable-include-info=flags,categories,casing,numbers
sets the required character informations to be included in the table. If omitted, all available informations are included.--enable-strict-level=0
sets the strict level, which excludes certain characters considered as unsafe, for example, surrogates. Defaut is 0.
0: do not exclude any characters
1: define surrogates as invalid
./configure \
--enable-symbol-prefix=bla \
--enable-categories=Lu,Ll,Lt,Lm,Lo \
--enable-snake-case \
--enable-strict-level=1
This project is designed to be used as an Automake subproject. To match your projects namespace, export the enviroment variables inside the main project's configure.ac
:
...
# put before `AC_CONFIG_SUBDIRS`
# use symbol prefix
# omit for default prefix
export UT_SYMBOL_PREFIX="myprefix"
# use snake case symbols
export UT_SNAKE_CASE=1
# define character categories to include
# omit to include all categories
export UT_CATEGORIES=Lu,Ll,Lt
# define character information to include
# omit to include all available information
export UT_INCLUDE_INFO=flags,categories
# exclude additional characters, for example, surrogates
# omit for using default (0)
# 0: do not exclude any characters
# 1: define surrogates as invalid
export UT_STRICT_LEVEL=0
AC_CONFIG_SUBDIRS([unicode-table])
...
The source templates unicode-table.h.in
and unicode-table.c.in
located in src
contain the structure and data placeholders for the header and source file. They can be editted if needed.