Skip to content

A lightweight implementation of the Unicode Text Segmentation (UAX #29)

License

Notifications You must be signed in to change notification settings

cometkim/unicode-segmenter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

unicode-segmenter

NPM Package Version NPM Downloads Integration codecov LICENSE - MIT

A lightweight implementation of the Unicode Text Segmentation (UAX #29)

  • Verified spec-compliance: Up-to-date Unicode data, passes the official Unicode test suites, verifies full compliance with the Intl.Segmenter API via additional property-based testing, maintaining 100% coverage.

  • Excellent compatibility: It works well on older browsers, edge runtimes, and React Native (Hermes).

  • Zero-dependencies: It doesn't bloat node_modules or the networks tab. Just a small minimal snippet.

  • Small bundle size: It effectively compresses Unicode data and provides a tree-shakeable format, eliminating unused codes.

  • Extremely efficient: It's carefully optimized for performance, making it the fastest one in the ecosystemโ€”outperforming even the built-in Intl.Segmenter.

  • TypeScript: It's fully type-checked, and provides type definitions and JSDoc.

  • ESM-first: It natively supports ES Modules, and still supports CommonJS.

Note

unicode-segmenter is now e18e recommendation!

Unicodeยฎ Version

Unicodeยฎ 16.0.0

Unicodeยฎ Standard Annex #29 - Revision 45 (2024-08-28)

APIs

There are several entries for text segmentation.

And extra utilities for combined use cases.

Export unicode-segmenter/grapheme

Utilities for text segmentation by extended grapheme cluster rules.

Example: Get grapheme segments

import { graphemeSegments } from 'unicode-segmenter/grapheme';

[...graphemeSegments('aฬeฬoฬˆฬฒ\r\n')];
// 0: { segment: 'aฬ', index: 0, input: 'aฬeฬoฬˆฬฒ\r\n' }
// 1: { segment: 'eฬ', index: 2, input: 'aฬeฬoฬˆฬฒ\r\n' }
// 2: { segment: 'oฬˆฬฒ', index: 4, input: 'aฬeฬoฬˆฬฒ\r\n' }
// 3: { segment: '\r\n', index: 7, input: 'aฬeฬoฬˆฬฒ\r\n' }

Example: Split graphemes

import { splitGraphemes } from 'unicode-segmenter/grapheme';

[...splitGraphemes('#๏ธโƒฃ*๏ธโƒฃ0๏ธโƒฃ1๏ธโƒฃ2๏ธโƒฃ')];
// 0: #๏ธโƒฃ
// 1: *๏ธโƒฃ
// 2: 0๏ธโƒฃ
// 3: 1๏ธโƒฃ
// 4: 2๏ธโƒฃ

Example: Count graphemes

import { countGraphemes } from 'unicode-segmenter/grapheme';

'๐Ÿ‘‹ ์•ˆ๋…•!'.length;
// => 6
countGraphemes('๐Ÿ‘‹ ์•ˆ๋…•!');
// => 5

'aฬeฬoฬˆฬฒ'.length;
// => 7
countGraphemes('aฬeฬoฬˆฬฒ');
// => 3

Note

countGraphemes() is a small wrapper around graphemeSegments().

If you need it more than once at a time, consider memoization or use graphemeSegments() or splitSegments() once instead.

Example: Build an advanced grapheme matcher

graphemeSegments() exposes some knowledge identified in the middle of the process to support some useful cases.

For example, knowing the Grapheme_Cluster_Break category at the beginning and end of a segment can help approximately infer the applied boundary rule.

import { graphemeSegments, GraphemeCategory } from 'unicode-segmenter/grapheme';

function* matchEmoji(str) {
  for (const { segment, _catBegin } of graphemeSegments(input)) {
    // `_catBegin` identified as Extended_Pictographic means the segment is emoji
    if (_catBegin === GraphemeCategory.Extended_Pictographic) {
      yield segment;
    }
  }
}

[...matchEmoji('1๐ŸŒท2๐ŸŽ3๐Ÿ’ฉ4๐Ÿ˜œ5๐Ÿ‘')]
// 0: ๐ŸŒท
// 1: ๐ŸŽ
// 2: ๐Ÿ’ฉ
// 3: ๐Ÿ˜œ
// 4: ๐Ÿ‘

Export unicode-segmenter/intl-adapter

Intl.Segmenter API adapter (only granularity: "grapheme" available yet)

import { Segmenter } from 'unicode-segmenter/intl-adapter';

// Same API with the `Intl.Segmenter`
const segmenter = new Segmenter();

Export unicode-segmenter/intl-polyfill

Intl.Segmenter API polyfill (only granularity: "grapheme" available yet)

// Apply polyfill to the `globalThis.Intl` object.
import 'unicode-segmenter/intl-polyfill';

const segmenter = new Intl.Segmenter();

Export unicode-segmenter/emoji

Utilities for matching emoji-like characters.

Example: Use Unicode emoji property matches

import {
  isEmojiPresentation,    // match \p{Emoji_Presentation}
  isExtendedPictographic, // match \p{Extended_Pictographic}
} from 'unicode-segmenter/emoji';

isEmojiPresentation('๐Ÿ˜'.codePointAt(0));
// => true
isEmojiPresentation('โ™ก'.codePointAt(0));
// => false

isExtendedPictographic('๐Ÿ˜'.codePointAt(0));
// => true
isExtendedPictographic('โ™ก'.codePointAt(0));
// => true

Export unicode-segmenter/general

Utilities for matching alphanumeric characters.

Example: Use Unicode general property matchers

import {
  isLetter,       // match \p{L}
  isNumeric,      // match \p{N}
  isAlphabetic,   // match \p{Alphabetic}
  isAlphanumeric, // match [\p{N}\p{Alphabetic}]
} from 'unicode-segmenter/general';

Export unicode-segmenter/utils

You can access some internal utilities to deal with JavaScript strings.

Example: Handle UTF-16 surrogate pairs

import {
  isHighSurrogate,
  isLowSurrogate,
  surrogatePairToCodePoint,
} from 'unicode-segmenter/utils';

const u32 = '๐Ÿ˜';
const hi = u32.charCodeAt(0);
const lo = u32.charCodeAt(1);

if (isHighSurrogate(hi) && isLowSurrogate(lo)) {
  const codePoint = surrogatePairToCodePoint(hi, lo);
  // => equivalent to u32.codePointAt(0)
}

Example: Determine the length of a character

import { isBMP } from 'unicode-segmenter/utils';

const char = '๐Ÿ˜'; // .length = 2
const cp = char.codePointAt(0);

char.length === isBMP(cp) ? 1 : 2;
// => true

Runtime Compatibility

unicode-segmenter uses only fundamental features of ES2015, making it compatible with most browsers.

To ensure compatibility, the runtime should support:

If the runtime doesn't support these features, it can easily be fulfilled with tools like Babel.

React Native Support

Since Hermes doesn't support the Intl.Segmenter API yet, unicode-segmenter is a good alternative.

unicode-segmenter is compiled into small & efficient Hermes bytecode than other JavaScript libraries. See the benchmark for details.

Comparison

unicode-segmenter aims to be lighter and faster than alternatives in the ecosystem while fully spec compliant. So the benchmark is tracking several libraries' performance, bundle size, and Unicode version compliance.

unicode-segmenter/grapheme vs

JS Bundle Stats

Name Unicodeยฎ ESM? Size Size (min) Size (min+gzip) Size (min+br)
unicode-segmenter/grapheme 16.0.0 โœ”๏ธ 17,313 12,783 5,285 3,946
graphemer 15.0.0 โœ–๏ธ ๏ธ 410,435 95,104 15,752 10,660
grapheme-splitter 10.0.0 โœ–๏ธ 122,252 23,680 7,852 4,841
@formatjs/intl-segmenter* 15.0.0 โœ–๏ธ 491,043 318,721 54,248 34,380
unicode-segmentation* 16.0.0 โœ”๏ธ 56,529 52,443 24,110 17,343
Intl.Segmenter* - - 0 0 0 0
  • @formatjs/intl-segmenter handles grapheme, word, and sentence, but it's not tree-shakable.
  • unicode-segmentation size contains only minimum WASM binary and its bindings to execute benchmarking. It will increases to expose more features.
  • Intl.Segmenter's Unicode data depends on the host, and may not be up-to-date.
  • Intl.Segmenter may not be available in some old browsers, edge runtimes, or embedded environments.

Hermes Bytecode Stats

Name Bytecode size Bytecode size (gzip)*
unicode-segmenter/grapheme 24,386 12,690
graphemer 133,949 31,710
grapheme-splitter 63,810 19,125
  • It would be compressed when included as an app asset.

Runtime Performance

Here is a brief explanation, and you can see archived benchmark results.

Performance in Node.js: unicode-segmenter/grapheme is significantly faster than alternatives.

Performance in Bun: unicode-segmenter/grapheme has almost the same performance as the built-in Intl.Segmenter, with no performance degradation compared to other JavaScript libraries.

Performance in Browsers: The performance in browser environments varies greatly due to differences in browser engines and versions, which makes benchmarking less consistent. Despite these variations, unicode-segmenter/grapheme generally outperforms other JavaScript libraries in most environments.

Performance in React Native: unicode-segmenter/grapheme is significantly faster than alternatives when compiled to Hermes bytecode. It's 3~8x faster than graphemer and 20~26x faster than grapheme-splitter, with the performance gap increasing with input size.

Performance in QuickJS: unicode-segmenter/grapheme is the only usable library in terms of performance.

Instead of trusting these claims, you can try yarn perf:grapheme directly in your environment or build your own benchmark.

Acknowledgments

LICENSE

MIT