CaseInsensitive mapping generator tool #3

dilijev · 2017-01-12T19:16:16Z

Tool for taking UnicodeData.txt and CaseFolding.txt and generating a case-insensitive mapping table.

I've learned that designation of MappingSource is actually not useful in the expected semantic sense. In the table in CaseInsensitive.cpp, the difference between MappingSource::UnicodeData and MappingSource::CaseFolding is actually that the former will do mappings with /i and not /u, and the latter will only map if the /u is supplied. Therefore, I made the classification of rows in the table manually.

It appears there is still a bug with the transitive closure over equivalence classes with misses a couple of cases. I was able to use the existing table and manually correct these issues when relevant.

See also the source code at the HEAD of this branch: https://github.com/dilijev/ChakraCore/tree/CaseInsensitive/tools/Unicode/CaseInsensitive

Associated with: chakra-core#2356

/cc @tcare

…put is binary equivalent to before the change.

…e file.

…on verification.

…without /u

@tcare

…nicode 6.3 and later, up to 8.0) to full Unicode 8.0 support. Fixes #517 Merge pull request #2356 from dilijev:unicase Update CaseInsensitive table from hybrid of (Unicode 6.3 and later, up to 8.0) to full Unicode 8.0 support. Fixes #517 Note: The current standard wants Unicode 9.0 but it might be too risky to update that far in a stabilization branch. Opened #2367 to track this work item. The table was generated in the past but then was (mostly) manually edited to include various optimizations and to fix bugs over the years. To make sure we got a complete update, I wrote a tool to generate the table. ## CaseInsensitive mapping generator tool PR: dilijev#3 Source: https://github.com/dilijev/ChakraCore/tree/CaseInsensitive/tools/Unicode/CaseInsensitive From this tool I was able to see and apply the differences from the current implementation to the correct implementation. In order to keep the change as small as possible, I used the diff as a reference for what needed changing and left out non-essential diffs. Additionally, the tool generates a suite of tests to track regressions against the update and ensure that the implementation does what is expected. I took some key tests from that suite and created the test file contained in this PR. # Overview of Changes I have staged the changes to hopefully make this easier to review. Here's an overview. NOTE: The individual commits list or summarize the relevant lines of UnicodeData.txt where applicable. First, I normalized the existing table to a reasonable format (same as the output of the tool) to make the later commits more clear. This involves fixing the casing and sorting deltas on each line in ascending order. 3d0f37f Next, I fixed a few bugs with the current table that were preventing some cases from being matched correctly. abb5d91 4894d24 25049de Added new codepoints: f197902 - GREEK LETTER YOT af2d083 - Cyrillic cba5439 - Cherokee 6c25a51 - Latin extensions Other tests and fixes: cb736ab - Add test cases from #517 to ensure those issues are fixed. fbfb953 - 0x0345 and case-insensitive equivalent characters with or without /u flag. dc3e750 - Case-insensitive matching for Cherokee only with /u. [1] d96eed5 - All other Unicode 8.0 cases of case-insensitive matching only with /u. [1] Added generated tests. [1] These were with a focus on compat with v8 as determined by running the full regression test suite I generated against node-6.9.4-LTS and node-7.4.0 (latest), and double-checking a handful of tests against the latest stable Chrome (v 55). # Test Coverage * **Regex test run successful!** `Summary: E:\d\RegexTestCollateral had 151147 tests; 0 failures` * Internal and slow tests pass. * Note that PRs are merged with the target branch before running Jenkins checks so attempting to run slow tests on this PR would result in failures as per #2316 -- but running them locally on this branch, the tests pass. # Reviewers @tcare @bterlson @boingoing @Cellule - Thank you for your assistance with this change and for your code reviews.

@tcare

… hybrid of (Unicode 6.3 and later, up to 8.0) to full Unicode 8.0 support. Fixes #517 Merge pull request #2356 from dilijev:unicase Update CaseInsensitive table from hybrid of (Unicode 6.3 and later, up to 8.0) to full Unicode 8.0 support. Fixes #517 Note: The current standard wants Unicode 9.0 but it might be too risky to update that far in a stabilization branch. Opened #2367 to track this work item. The table was generated in the past but then was (mostly) manually edited to include various optimizations and to fix bugs over the years. To make sure we got a complete update, I wrote a tool to generate the table. ## CaseInsensitive mapping generator tool PR: dilijev#3 Source: https://github.com/dilijev/ChakraCore/tree/CaseInsensitive/tools/Unicode/CaseInsensitive From this tool I was able to see and apply the differences from the current implementation to the correct implementation. In order to keep the change as small as possible, I used the diff as a reference for what needed changing and left out non-essential diffs. Additionally, the tool generates a suite of tests to track regressions against the update and ensure that the implementation does what is expected. I took some key tests from that suite and created the test file contained in this PR. # Overview of Changes I have staged the changes to hopefully make this easier to review. Here's an overview. NOTE: The individual commits list or summarize the relevant lines of UnicodeData.txt where applicable. First, I normalized the existing table to a reasonable format (same as the output of the tool) to make the later commits more clear. This involves fixing the casing and sorting deltas on each line in ascending order. 3d0f37f Next, I fixed a few bugs with the current table that were preventing some cases from being matched correctly. abb5d91 4894d24 25049de Added new codepoints: f197902 - GREEK LETTER YOT af2d083 - Cyrillic cba5439 - Cherokee 6c25a51 - Latin extensions Other tests and fixes: cb736ab - Add test cases from #517 to ensure those issues are fixed. fbfb953 - 0x0345 and case-insensitive equivalent characters with or without /u flag. dc3e750 - Case-insensitive matching for Cherokee only with /u. [1] d96eed5 - All other Unicode 8.0 cases of case-insensitive matching only with /u. [1] Added generated tests. [1] These were with a focus on compat with v8 as determined by running the full regression test suite I generated against node-6.9.4-LTS and node-7.4.0 (latest), and double-checking a handful of tests against the latest stable Chrome (v 55). # Test Coverage * **Regex test run successful!** `Summary: E:\d\RegexTestCollateral had 151147 tests; 0 failures` * Internal and slow tests pass. * Note that PRs are merged with the target branch before running Jenkins checks so attempting to run slow tests on this PR would result in failures as per #2316 -- but running them locally on this branch, the tests pass. # Reviewers @tcare @bterlson @boingoing @Cellule - Thank you for your assistance with this change and for your code reviews.

mathiasbynens · 2017-01-14T09:16:58Z

tools/Unicode/CaseInsensitive/prototypes.ts

+        return ("0000" + this).slice(-4); // take the last four characters after left-padding
+    }
+
+    proto.toCodepoint = function (): number {


Nit: should be toCodePoint for consistency with codePoint elsewhere

dilijev and others added 30 commits January 1, 2017 21:57

initial code

e556a1f

don't use readline

5521cbb

convert to LF

9a2df79

Tried using Lazy module.

12771f9

Produce full list of equivalence classes; try to render to file.

7cb6a6e

Fix bug, render to file.

cfab014

mappings.ts (UnicodeData.txt to source table)

07b4326

Create mapping from UnicodeData.txt

baa700b

Arguments; initial CaseFoldingRecord

4f8c8f5

updated args parsing

e2aff0d

canonicalizeDeltas; createFromCaseFoldingRecord; various cleanup

8eb28f1

Cleanup

c6a5736

Sort and insertion order of Row objects.

95ac38a

comments cleanup

7e579d7

Refactored and moved files around to make this more maintainable. Out…

583b69f

…put is binary equivalent to before the change.

Tried to use modules properly, gave up and back to generating a singl…

ade76b6

…e file.

Refactored tests and some cleanup.

4997ee9

Try to convert to modules and reuse code again.

4471d31

Update to use external modules properly.

7f09964

Extend prototypes in a more encapsulated way.

33ea95a

Restore TableToEquiv to working order.

8da62fb

Added EquivClass.ts

2d6c0fa

Update to use 'export default' and be more es6-modules-like.

1365a64

Use EquivClass to generate the table of UnicodeData mappings.

cebe669

createFromCaseFoldingEntry

4f52af1

Add package.json

19d7135

Folding

d7ffefb

Transitive closure finished, things looking mostly good. Now working …

60dbdcd

…on verification.

format normalization

47c5b28

Row.expandRows

60054ea

dilijev added 6 commits January 11, 2017 11:18

regression-suite

c7a2722

Update regression suite generation and notes.

48e1524

Add UCD 7.0

ffe0005

update regression generation

8eb0e71

Add UCD 6.2 and 6.3

ea73457

Generate non-unicode-flag tests and check which ones should not fold …

197bf9f

…without /u

dilijev mentioned this pull request Jan 12, 2017

Update CaseInsensitive table from hybrid of (Unicode 6.3 and later, up to 8.0) to full Unicode 8.0 support. Fixes #517 chakra-core/ChakraCore#2356

Merged

mathiasbynens reviewed Jan 14, 2017

View reviewed changes

dilijev added 4 commits March 27, 2017 14:39

Update build settings.

51597f7

Add missing newlines at EOF

6e35ad1

Fix deprecation warning

5ae0ed7

WIP

2368d72

dilijev mentioned this pull request Nov 7, 2017

Code quality: in CaseInsensitive.cpp, add braces around entries and remove defunct gawk scripts chakra-core/ChakraCore#4152

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CaseInsensitive mapping generator tool #3

CaseInsensitive mapping generator tool #3

Uh oh!

dilijev commented Jan 12, 2017

Uh oh!

mathiasbynens Jan 14, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CaseInsensitive mapping generator tool #3

Are you sure you want to change the base?

CaseInsensitive mapping generator tool #3

Uh oh!

Conversation

dilijev commented Jan 12, 2017

Uh oh!

mathiasbynens Jan 14, 2017

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants