Skip to content

Conversation

@burgerrg
Copy link
Contributor

@burgerrg burgerrg commented Jan 14, 2026

Support Unicode 17.0.0.

The function char-indic-break-property was added to support correct grapheme cluster identification for Indic scripts.

The grapheme cluster break test was updated to use the test file from the Unicode Consortium.

Follow unicode/Readme to make future Unicode updates.

@mflatt
Copy link
Contributor

mflatt commented Jan 15, 2026

Would the name char-indic-conjunct-break-property (with property at the end) for the new function better parallel char-grapheme-break-property?

Historical note: I originally implemented char-grapheme-cluster-step for Unicode 14.0. It looks like the Indic conjuct break refinement was added in Unicode 15.0. I like how the improved testing approach here will ensure that we don't miss those kinds of additions in the future.

@burgerrg
Copy link
Contributor Author

I considered char-indic-conjunct-break-property, but the name in Unicode is Indic_Conjunct_Break.

The Unicode name for char-grapheme-break-property is Grapheme_Cluster_Break.

Maybe we should call it char-indic-break-property?

Compute the step table when "extract-char-cases.ss" is run.
@mflatt
Copy link
Contributor

mflatt commented Jan 15, 2026

The name char-indic-break-property would be fine with me.

The extra work to handle indic conjuncts slows down char-grapheme-step by 10-30% in cases like (char-grapheme-step (integer->char #x0308) 0), (string-grapheme-count "\x0020;\x0308;\x000D;"), or (string-grapheme-count "\x3bb;\x3bb;"). About half of the increase is from extra table lookups (the old implementation even tried to avoid char-extended-pictographic? if it could), and half is dealing with the extra information. Overall, since char-grapheme-step encodes a state machine in relatively small inputs, it could be turned into just a couple of table lookups.

I pushed a commit at https://github.com/mflatt/ChezScheme/tree/grapheme-step-table for your consideration. It moves your improved grapheme-step implementation into "extract-char-cases.ss" and generates a table from encoded property information plus the current state (mapping to a new state). This makes char-grapheme-step about twice as fast as before the PR. The table is fairly small: it occupies about 16k on a 64-bit architecture. Meanwhile, merging the information in grapheme-cluster-break-table and indic-conjunct-break-table saves about 74k. (In the end, total growth relative to before the PR is about 38k, instead of about 95k.)

@burgerrg
Copy link
Contributor Author

Using a table for the step function is a great idea! I incorporated your code and moved the $char-extended-pictographic? function so that it uses the updated grapheme-break table.

I also renamed the new function to char-indic-break-property.

@mflatt
Copy link
Contributor

mflatt commented Jan 15, 2026

@burgerrg All of your changes look good to me!

I noticed a bug in the code that I added: if char-grapheme-step is given a fixnum that is too large as the state, then it can lead to an invalid memory reference. For example, (char-grapheme-step #\a 49880000) probably throws "invalid memory reference". I think the solution is to guard the mask argument char-grapheme-step-lookup to force it in range. Would you like me to work on a repair, or would you prefer to do it?

@burgerrg
Copy link
Contributor Author

Thanks for finding that! I masked out the state fixnum. Please double-check that I have the right mask. I also found a couple places in 5_4.mo that report expected character counts and updated them.

@burgerrg burgerrg merged commit 111e225 into main Jan 16, 2026
0 of 32 checks passed
@burgerrg burgerrg deleted the unicode-17 branch January 16, 2026 14:37
@burgerrg
Copy link
Contributor Author

@mflatt, thank you for your help with this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants