Skip to content

becky82/mteh

Repository files navigation

More than enough Hanzi (MteH)

A curated set of ~4,270 simplified Chinese characters for advanced language learners.

The MteH corpus is designed as an "endgame corpus" for advanced students. Basically, if you learn these characters, you're practically "done for life" studying simplified Chinese characters (congratulations!). Obviously, there are more simplified Chinese characters than this (in proper nouns, scientific terms, chengyu, Chinese history, online usernames, etc.), but at a certain point you've got to draw the line and say "this is my endgame".

Currently, MteH focuses entirely on simplified Chinese characters, especially those you’ll encounter in mainland China and in HSK exams.

There is also an Anki Deck (here) which should already work, but should be thought of as a work-in-progress. (On a computer, AnkiDraw allows you to handwrite.)


Summary

The MteH corpus is built to minimize "missing" characters; any characters not included are extremely rare or niche. The initial version (v0.1.1) merges the following corpora:

# Corpus #chars #used Source / Reference
1 HSK 1.0 2,866 2,866 pre-2010, 11 levels
2 HSK 2.0 2,663 2,663 post-2010, 6 levels
3 HSK 3.0 3,000 3,000 2021 version, 9 levels
4 TOCFL 3,027* 2,998 Taiwan's TOCFL 3100 + 33 traditional chars
5 K-5 1,817 1,816 K-5 word frequency
6 通用规范汉字表 3,500 3,500 Ministry of Education (2013)
7 现代汉语常用字表 3,500 3,498 Ministry of Education (1988)
8 primary school 2,468 2,467 China primary schools (2016)
9 Singapore 1,655 1,655 Singapore primary schools (2015)
10 Heisig 3,018 3,018 Heisig & Richardson, Remembering Simplified Hanzi I–II
11 Hoenig 2,177 2,151 Learn & Remember 2,178 Characters and Their Meanings
12 Jun Da 4,485* 4,100 modern Chinese corpus
13 SUBTLEX 4,462* 4,034 film and TV subtitle corpus
14 Tsai 4,329* 3,872 Usenet newsgroups (1993-1994)
15 Wikipedia 3,476* 3,196 Chinese Wikipedia
16 classical 1,968* 1,840 prior to the end of the Han dynasty
17 THUOCL 3,421* 3,156 mostly Sogou webpages
18 Leeds 4,230* 3,984 Internet corpus
19 BLCU 4,445* 4,013 "balanced", written Chinese
20 LWC 4,130* 3,863 Sina Weibo
21 food 1,182 1,093 food-related terms
22 species 4,086 3,121 species names
23 Chinese surnames 1,745 1,539 1,807 Chinese surnames
24 Chinese names 2,269 1,948 1,200,000 Chinese names
25 city-geo 1,277 1,116 mainland China city terms
26 company 4,363* 3,554 company proper nouns
27 med-orgs 4,826 3,633 medical organizations

Those marked * have extraction steps (documented in their respective readmes): selection of top-N words/characters, conversion from traditional to simplified.

Characters are ordered in Unicode order (excluding variants), grouping visually or structurally related forms as much as possible.

MteH also incorporates:

Statistics and debug reports: missing chars; corpus histogram; debug; modifications; syllables.


License

© 2025 Rebecca J. Stones
Licensed under CC BY-SA 4.0