Unicode 17.0.0 #1006

burgerrg · 2026-01-14T19:48:40Z

Support Unicode 17.0.0.

The function char-indic-break-property was added to support correct grapheme cluster identification for Indic scripts.

The grapheme cluster break test was updated to use the test file from the Unicode Consortium.

Follow unicode/Readme to make future Unicode updates.

mflatt · 2026-01-15T02:53:49Z

Would the name char-indic-conjunct-break-property (with property at the end) for the new function better parallel char-grapheme-break-property?

Historical note: I originally implemented char-grapheme-cluster-step for Unicode 14.0. It looks like the Indic conjuct break refinement was added in Unicode 15.0. I like how the improved testing approach here will ensure that we don't miss those kinds of additions in the future.

burgerrg · 2026-01-15T15:08:21Z

I considered char-indic-conjunct-break-property, but the name in Unicode is Indic_Conjunct_Break.

The Unicode name for char-grapheme-break-property is Grapheme_Cluster_Break.

Maybe we should call it char-indic-break-property?

Compute the step table when "extract-char-cases.ss" is run.

mflatt · 2026-01-15T17:28:43Z

The name char-indic-break-property would be fine with me.

The extra work to handle indic conjuncts slows down char-grapheme-step by 10-30% in cases like (char-grapheme-step (integer->char #x0308) 0), (string-grapheme-count "\x0020;\x0308;\x000D;"), or (string-grapheme-count "\x3bb;\x3bb;"). About half of the increase is from extra table lookups (the old implementation even tried to avoid char-extended-pictographic? if it could), and half is dealing with the extra information. Overall, since char-grapheme-step encodes a state machine in relatively small inputs, it could be turned into just a couple of table lookups.

I pushed a commit at https://github.com/mflatt/ChezScheme/tree/grapheme-step-table for your consideration. It moves your improved grapheme-step implementation into "extract-char-cases.ss" and generates a table from encoded property information plus the current state (mapping to a new state). This makes char-grapheme-step about twice as fast as before the PR. The table is fairly small: it occupies about 16k on a 64-bit architecture. Meanwhile, merging the information in grapheme-cluster-break-table and indic-conjunct-break-table saves about 74k. (In the end, total growth relative to before the PR is about 38k, instead of about 95k.)

burgerrg · 2026-01-15T20:21:58Z

Using a table for the step function is a great idea! I incorporated your code and moved the $char-extended-pictographic? function so that it uses the updated grapheme-break table.

I also renamed the new function to char-indic-break-property.

mflatt · 2026-01-15T22:18:24Z

@burgerrg All of your changes look good to me!

I noticed a bug in the code that I added: if char-grapheme-step is given a fixnum that is too large as the state, then it can lead to an invalid memory reference. For example, (char-grapheme-step #\a 49880000) probably throws "invalid memory reference". I think the solution is to guard the mask argument char-grapheme-step-lookup to force it in range. Would you like me to work on a repair, or would you prefer to do it?

burgerrg · 2026-01-16T00:36:16Z

Thanks for finding that! I masked out the state fixnum. Please double-check that I have the right mask. I also found a couple places in 5_4.mo that report expected character counts and updated them.

burgerrg · 2026-01-16T14:38:54Z

@mflatt, thank you for your help with this!

burgerrg force-pushed the unicode-17 branch from 3e9b2de to 6dc0903 Compare January 14, 2026 20:03

Unicode 17.0.0

fed7a5a

burgerrg force-pushed the unicode-17 branch from 6dc0903 to fed7a5a Compare January 14, 2026 20:12

turn grapheme-char-step into a table lookup

bec7c8c

Compute the step table when "extract-char-cases.ss" is run.

burgerrg added 3 commits January 15, 2026 13:38

rename to char-indic-break-property

6b56385

move $char-extended-pictographic? to unicode-char-cases

8c01ed9

use a vector to represent the list of properties

8cd9b96

burgerrg added 3 commits January 15, 2026 19:12

mask out invalid state bits

65a8e00

use fxlogtest

0891477

more test updates

05227d5

mflatt approved these changes Jan 16, 2026

View reviewed changes

simplify transition graph building

207bd39

burgerrg force-pushed the unicode-17 branch from d317cb9 to 207bd39 Compare January 16, 2026 14:35

burgerrg merged commit 111e225 into main Jan 16, 2026
0 of 32 checks passed

burgerrg deleted the unicode-17 branch January 16, 2026 14:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode 17.0.0 #1006

Unicode 17.0.0 #1006

Uh oh!

burgerrg commented Jan 14, 2026 •

edited

Loading

Uh oh!

mflatt commented Jan 15, 2026

Uh oh!

burgerrg commented Jan 15, 2026

Uh oh!

mflatt commented Jan 15, 2026

Uh oh!

burgerrg commented Jan 15, 2026

Uh oh!

mflatt commented Jan 15, 2026

Uh oh!

burgerrg commented Jan 16, 2026

Uh oh!

Uh oh!

burgerrg commented Jan 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Unicode 17.0.0 #1006

Unicode 17.0.0 #1006

Uh oh!

Conversation

burgerrg commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mflatt commented Jan 15, 2026

Uh oh!

burgerrg commented Jan 15, 2026

Uh oh!

mflatt commented Jan 15, 2026

Uh oh!

burgerrg commented Jan 15, 2026

Uh oh!

mflatt commented Jan 15, 2026

Uh oh!

burgerrg commented Jan 16, 2026

Uh oh!

Uh oh!

burgerrg commented Jan 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

burgerrg commented Jan 14, 2026 •

edited

Loading