A curated set of ~4,270 simplified Chinese characters for advanced language learners.
The MteH corpus is designed as an "endgame corpus" for advanced students. Basically, if you learn these characters, you're practically "done for life" studying simplified Chinese characters (congratulations!). Obviously, there are more simplified Chinese characters than this (in proper nouns, scientific terms, chengyu, Chinese history, online usernames, etc.), but at a certain point you've got to draw the line and say "this is my endgame".
Currently, MteH focuses entirely on simplified Chinese characters, especially those you’ll encounter in mainland China and in HSK exams.
- MteH corpus (v0.1.2) (plain text)
- Handwriting practice (PDFs to print out)
- Extra characters (good to know, but not part of MteH)
- Repeated-component characters
- Periodic table of the elements
- Province abbreviations
- Characters/words using or related to 虫 (insects; lower life forms)
- Characters/words using or related to 鸟 (birds)
- Characters/words using or related to 鱼 (fish)
- Characters/words using or related to 木 (trees; wood)
There is also an Anki Deck (here) which should already work, but should be thought of as a work-in-progress. (On a computer, AnkiDraw allows you to handwrite.)
The MteH corpus is built to minimize "missing" characters; any characters not included are extremely rare or niche. The initial version (v0.1.1) merges the following corpora:
| # | Corpus | #chars | #used | Source / Reference |
|---|---|---|---|---|
| 1 | HSK 1.0 | 2,866 | 2,866 | pre-2010, 11 levels |
| 2 | HSK 2.0 | 2,663 | 2,663 | post-2010, 6 levels |
| 3 | HSK 3.0 | 3,000 | 3,000 | 2021 version, 9 levels |
| 4 | TOCFL | 3,027* | 2,998 | Taiwan's TOCFL 3100 + 33 traditional chars |
| 5 | K-5 | 1,817 | 1,816 | K-5 word frequency |
| 6 | 通用规范汉字表 | 3,500 | 3,500 | Ministry of Education (2013) |
| 7 | 现代汉语常用字表 | 3,500 | 3,498 | Ministry of Education (1988) |
| 8 | primary school | 2,468 | 2,467 | China primary schools (2016) |
| 9 | Singapore | 1,655 | 1,655 | Singapore primary schools (2015) |
| 10 | Heisig | 3,018 | 3,018 | Heisig & Richardson, Remembering Simplified Hanzi I–II |
| 11 | Hoenig | 2,177 | 2,151 | Learn & Remember 2,178 Characters and Their Meanings |
| 12 | Jun Da | 4,485* | 4,100 | modern Chinese corpus |
| 13 | SUBTLEX | 4,462* | 4,034 | film and TV subtitle corpus |
| 14 | Tsai | 4,329* | 3,872 | Usenet newsgroups (1993-1994) |
| 15 | Wikipedia | 3,476* | 3,196 | Chinese Wikipedia |
| 16 | classical | 1,968* | 1,840 | prior to the end of the Han dynasty |
| 17 | THUOCL | 3,421* | 3,156 | mostly Sogou webpages |
| 18 | Leeds | 4,230* | 3,984 | Internet corpus |
| 19 | BLCU | 4,445* | 4,013 | "balanced", written Chinese |
| 20 | LWC | 4,130* | 3,863 | Sina Weibo |
| 21 | food | 1,182 | 1,093 | food-related terms |
| 22 | species | 4,086 | 3,121 | species names |
| 23 | Chinese surnames | 1,745 | 1,539 | 1,807 Chinese surnames |
| 24 | Chinese names | 2,269 | 1,948 | 1,200,000 Chinese names |
| 25 | city-geo | 1,277 | 1,116 | mainland China city terms |
| 26 | company | 4,363* | 3,554 | company proper nouns |
| 27 | med-orgs | 4,826 | 3,633 | medical organizations |
Those marked * have extraction steps (documented in their respective readmes): selection of top-N words/characters, conversion from traditional to simplified.
Characters are ordered in Unicode order (excluding variants), grouping visually or structurally related forms as much as possible.
MteH also incorporates:
- Character structure data and character drawings from Make Me a Hanzi and cjkvi-ids
- Frequency data from Jun Da’s modern corpus
- Images from Pexels, Wikimedia, etc.
Statistics and debug reports: missing chars; corpus histogram; debug; modifications; syllables.
© 2025 Rebecca J. Stones
Licensed under CC BY-SA 4.0