A lightweight implementation of the Unicode Text Segmentation (UAX #29)
-
Verified spec-compliance: Up-to-date Unicode data, passes the official Unicode test suites, verifies full compliance with the
Intl.Segmenter
API via additional property-based testing, maintaining 100% coverage. -
Excellent compatibility: It works well on older browsers, edge runtimes, and React Native (Hermes).
-
Zero-dependencies: It doesn't bloat
node_modules
or the networks tab. Just a small minimal snippet. -
Small bundle size: It effectively compresses Unicode data and provides a tree-shakeable format, eliminating unused codes.
-
Extremely efficient: It's carefully optimized for performance, making it the fastest one in the ecosystemโoutperforming even the built-in
Intl.Segmenter
. -
TypeScript: It's fully type-checked, and provides type definitions and JSDoc.
-
ESM-first: It natively supports ES Modules, and still supports CommonJS.
Note
unicode-segmenter is now e18e recommendation!
Unicodeยฎ 16.0.0
Unicodeยฎ Standard Annex #29 - Revision 45 (2024-08-28)
There are several entries for text segmentation.
unicode-segmenter/grapheme
: Segments and counts extended grapheme clustersunicode-segmenter/intl-adapter
:Intl.Segmenter
adapterunicode-segmenter/intl-polyfill
:Intl.Segmenter
polyfill
And extra utilities for combined use cases.
unicode-segmenter/emoji
: Matches single codepoint emojisunicode-segmenter/general
: Matches single codepoint alphanumericsunicode-segmenter/utils
: Handles UTF-16 codepoints
Utilities for text segmentation by extended grapheme cluster rules.
import { graphemeSegments } from 'unicode-segmenter/grapheme';
[...graphemeSegments('aฬeฬoฬฬฒ\r\n')];
// 0: { segment: 'aฬ', index: 0, input: 'aฬeฬoฬฬฒ\r\n' }
// 1: { segment: 'eฬ', index: 2, input: 'aฬeฬoฬฬฒ\r\n' }
// 2: { segment: 'oฬฬฒ', index: 4, input: 'aฬeฬoฬฬฒ\r\n' }
// 3: { segment: '\r\n', index: 7, input: 'aฬeฬoฬฬฒ\r\n' }
import { splitGraphemes } from 'unicode-segmenter/grapheme';
[...splitGraphemes('#๏ธโฃ*๏ธโฃ0๏ธโฃ1๏ธโฃ2๏ธโฃ')];
// 0: #๏ธโฃ
// 1: *๏ธโฃ
// 2: 0๏ธโฃ
// 3: 1๏ธโฃ
// 4: 2๏ธโฃ
import { countGraphemes } from 'unicode-segmenter/grapheme';
'๐ ์๋
!'.length;
// => 6
countGraphemes('๐ ์๋
!');
// => 5
'aฬeฬoฬฬฒ'.length;
// => 7
countGraphemes('aฬeฬoฬฬฒ');
// => 3
Note
countGraphemes()
is a small wrapper around graphemeSegments()
.
If you need it more than once at a time, consider memoization or use graphemeSegments()
or splitSegments()
once instead.
graphemeSegments()
exposes some knowledge identified in the middle of the process to support some useful cases.
For example, knowing the Grapheme_Cluster_Break category at the beginning and end of a segment can help approximately infer the applied boundary rule.
import { graphemeSegments, GraphemeCategory } from 'unicode-segmenter/grapheme';
function* matchEmoji(str) {
for (const { segment, _catBegin } of graphemeSegments(input)) {
// `_catBegin` identified as Extended_Pictographic means the segment is emoji
if (_catBegin === GraphemeCategory.Extended_Pictographic) {
yield segment;
}
}
}
[...matchEmoji('1๐ท2๐3๐ฉ4๐5๐')]
// 0: ๐ท
// 1: ๐
// 2: ๐ฉ
// 3: ๐
// 4: ๐
Intl.Segmenter
API adapter (only granularity: "grapheme"
available yet)
import { Segmenter } from 'unicode-segmenter/intl-adapter';
// Same API with the `Intl.Segmenter`
const segmenter = new Segmenter();
Intl.Segmenter
API polyfill (only granularity: "grapheme"
available yet)
// Apply polyfill to the `globalThis.Intl` object.
import 'unicode-segmenter/intl-polyfill';
const segmenter = new Intl.Segmenter();
Utilities for matching emoji-like characters.
import {
isEmojiPresentation, // match \p{Emoji_Presentation}
isExtendedPictographic, // match \p{Extended_Pictographic}
} from 'unicode-segmenter/emoji';
isEmojiPresentation('๐'.codePointAt(0));
// => true
isEmojiPresentation('โก'.codePointAt(0));
// => false
isExtendedPictographic('๐'.codePointAt(0));
// => true
isExtendedPictographic('โก'.codePointAt(0));
// => true
Utilities for matching alphanumeric characters.
import {
isLetter, // match \p{L}
isNumeric, // match \p{N}
isAlphabetic, // match \p{Alphabetic}
isAlphanumeric, // match [\p{N}\p{Alphabetic}]
} from 'unicode-segmenter/general';
You can access some internal utilities to deal with JavaScript strings.
import {
isHighSurrogate,
isLowSurrogate,
surrogatePairToCodePoint,
} from 'unicode-segmenter/utils';
const u32 = '๐';
const hi = u32.charCodeAt(0);
const lo = u32.charCodeAt(1);
if (isHighSurrogate(hi) && isLowSurrogate(lo)) {
const codePoint = surrogatePairToCodePoint(hi, lo);
// => equivalent to u32.codePointAt(0)
}
import { isBMP } from 'unicode-segmenter/utils';
const char = '๐'; // .length = 2
const cp = char.codePointAt(0);
char.length === isBMP(cp) ? 1 : 2;
// => true
unicode-segmenter
uses only fundamental features of ES2015, making it compatible with most browsers.
To ensure compatibility, the runtime should support:
If the runtime doesn't support these features, it can easily be fulfilled with tools like Babel.
Since Hermes doesn't support the Intl.Segmenter
API yet, unicode-segmenter
is a good alternative.
unicode-segmenter
is compiled into small & efficient Hermes bytecode than other JavaScript libraries. See the benchmark for details.
unicode-segmenter
aims to be lighter and faster than alternatives in the ecosystem while fully spec compliant. So the benchmark is tracking several libraries' performance, bundle size, and Unicode version compliance.
- graphemer@1.4.0 (16.6M+ weekly downloads on NPM)
- grapheme-splitter@1.0.4 (5.7M+ weekly downloads on NPM)
- @formatjs/intl-segmenter@11.5.7 (5.4K+ weekly downloads on NPM)
- WebAssembly build of unicode-segmentation@1.12.0 with minimum bindings
- Built-in
Intl.Segmenter
API
Name | Unicodeยฎ | ESM? | Size | Size (min) | Size (min+gzip) | Size (min+br) |
---|---|---|---|---|---|---|
unicode-segmenter/grapheme |
16.0.0 | โ๏ธ | 17,313 | 12,783 | 5,285 | 3,946 |
graphemer |
15.0.0 | โ๏ธ ๏ธ | 410,435 | 95,104 | 15,752 | 10,660 |
grapheme-splitter |
10.0.0 | โ๏ธ | 122,252 | 23,680 | 7,852 | 4,841 |
@formatjs/intl-segmenter * |
15.0.0 | โ๏ธ | 491,043 | 318,721 | 54,248 | 34,380 |
unicode-segmentation * |
16.0.0 | โ๏ธ | 56,529 | 52,443 | 24,110 | 17,343 |
Intl.Segmenter * |
- | - | 0 | 0 | 0 | 0 |
@formatjs/intl-segmenter
handles grapheme, word, and sentence, but it's not tree-shakable.unicode-segmentation
size contains only minimum WASM binary and its bindings to execute benchmarking. It will increases to expose more features.Intl.Segmenter
's Unicode data depends on the host, and may not be up-to-date.Intl.Segmenter
may not be available in some old browsers, edge runtimes, or embedded environments.
Name | Bytecode size | Bytecode size (gzip)* |
---|---|---|
unicode-segmenter/grapheme |
24,386 | 12,690 |
graphemer |
133,949 | 31,710 |
grapheme-splitter |
63,810 | 19,125 |
- It would be compressed when included as an app asset.
Here is a brief explanation, and you can see archived benchmark results.
Performance in Node.js: unicode-segmenter/grapheme
is significantly faster than alternatives.
- 6~15x faster than other JavaScript libraries
- 1.5~3x faster than WASM binding of the Rust's unicode-segmentation
- 1.5~3x faster than built-in
Intl.Segmenter
Performance in Bun: unicode-segmenter/grapheme
has almost the same performance as the built-in Intl.Segmenter
, with no performance degradation compared to other JavaScript libraries.
Performance in Browsers: The performance in browser environments varies greatly due to differences in browser engines and versions, which makes benchmarking less consistent. Despite these variations, unicode-segmenter/grapheme
generally outperforms other JavaScript libraries in most environments.
Performance in React Native: unicode-segmenter/grapheme
is significantly faster than alternatives when compiled to Hermes bytecode. It's 3~8x faster than graphemer
and 20~26x faster than grapheme-splitter
, with the performance gap increasing with input size.
Performance in QuickJS: unicode-segmenter/grapheme
is the only usable library in terms of performance.
Instead of trusting these claims, you can try yarn perf:grapheme
directly in your environment or build your own benchmark.
-
The Rust Unicode team (@unicode-rs):
The initial implementation was ported manually from unicode-segmentation library. -
Marijn Haverbeke (@marijnh):
Inspired a technique that can greatly compress Unicode data table from his library.