Skip to content

Data for the stroke input method (筆畫輸入法) in Chinese

Notifications You must be signed in to change notification settings

stroke-input/stroke-input-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Conway Stroke Data

A data set compiled manually by Conway (@yawnoc), used in the Android keyboard app Stroke Input Method (筆畫輸入法).

Stroke input method (generic, not the app)

The (generic) stroke input method is found on all dumbphones in HK and surrounds.

It is the simplest Chinese input method in existence. All strokes are classified into 5 types, entered via keypad:

# Stroke Type Comment
1 橫 Horizontal Includes rises (提) etc.
2 豎 Vertical
3 撇 Throw
4 點 Dot Includes presses (捺)
5 折 Break Basically everything else

Picture of a dumbphone with stroke input method on keys 1 to 5.

Contents of this repository

A. Manually compiled data

The following files contain data manually compiled by Conway (@yawnoc):

  • Tab-separated (code point, character, stroke sequence regex) triplets.
  • There are 28k+ entries. Because Conway (@yawnoc) is human, it is highly likely that there are some mistakes; please report these.
  • Licensed under CC-BY-4.0.
  • Lists of common phrases.
  • To be sorted by running sort.py.
  • Released into the public domain.
  • Rankings of commonly used characters.
  • Released into the public domain.

B. Automatically generated data

The following files contain data automatically generated by running generate.py, which parses codepoint-character-sequence.txt:

  • Lists of traditional-only and simplified-only characters.
  • Released into the public domain.
  • Tab-separated (stroke sequence, characters) pairs.
  • Licensed under CC-BY-4.0.

C. Scripts

  • Defines shell functions s (search), sp (search prefix), ss (search suffix).
  • Script used to generate sequence-characters.txt and characters-*.txt (by parsing codepoint-character-sequence.txt).
  • Licensed under MIT-0.
  • Script used to sort certain sections of phrases-*.txt.
  • Licensed under MIT-0.

D. Tests

  • Unit tests for generate.py.
  • Licensed under MIT-0.
  • Unit tests for sort.py.
  • Licensed under MIT-0.

Miscellanea for convenient reference (in comments)

Unicode strokes

CJK Strokes (Unicode block) (U+31C0 to U+31E3):

㇀㇁㇂㇃㇄㇅㇆㇇㇈㇉㇊㇋㇌㇍㇎㇏
㇐㇑㇒㇓㇔㇕㇖㇗㇘㇙㇚㇛㇜㇝㇞㇟
㇠㇡㇢㇣

Unicode composition

Ideographic Description Characters (Unicode block) (U+2FF0 to U+2FFB):

⿰⿱⿲⿳⿴⿵⿶⿷⿸⿹⿺⿻