Update CaseInsensitive table from hybrid of (Unicode 6.3 and later, up to 8.0) to full Unicode 8.0 support. Fixes #517 #2356

dilijev · 2017-01-11T02:02:46Z

Update CaseInsensitive table from hybrid of (Unicode 6.3 and later, up to 8.0) to full Unicode 8.0 support.

Fixes #517

Note: The current standard wants Unicode 9.0 but it might be too risky to update that far in a stabilization branch. Opened #2367 to track this work item.

The table was generated in the past but then was (mostly) manually edited to include various optimizations and to fix bugs over the years. To make sure we got a complete update, I wrote a tool to generate the table.

CaseInsensitive mapping generator tool

PR: dilijev#3
Source: https://github.com/dilijev/ChakraCore/tree/CaseInsensitive/tools/Unicode/CaseInsensitive

From this tool I was able to see and apply the differences from the current implementation to the correct implementation. In order to keep the change as small as possible, I used the diff as a reference for what needed changing and left out non-essential diffs. Additionally, the tool generates a suite of tests to track regressions against the update and ensure that the implementation does what is expected. I took some key tests from that suite and created the test file contained in this PR.

Overview of Changes

I have staged the changes to hopefully make this easier to review. Here's an overview.

NOTE: The individual commits list or summarize the relevant lines of UnicodeData.txt where applicable.

First, I normalized the existing table to a reasonable format (same as the output of the tool) to make the later commits more clear. This involves fixing the casing and sorting deltas on each line in ascending order.

3d0f37f

Next, I fixed a few bugs with the current table that were preventing some cases from being matched correctly.

abb5d91
4894d24
25049de

Added new codepoints:

f197902 - GREEK LETTER YOT
af2d083 - Cyrillic
cba5439 - Cherokee
6c25a51 - Latin extensions

Other tests and fixes:

cb736ab - Add test cases from #517 to ensure those issues are fixed.
fbfb953 - 0x0345 and case-insensitive equivalent characters with or without /u flag.
dc3e750 - Case-insensitive matching for Cherokee only with /u. [1]
d96eed5 - All other Unicode 8.0 cases of case-insensitive matching only with /u. [1] Added generated tests.

[1] These were with a focus on compat with v8 as determined by running the full regression test suite I generated against node-6.9.4-LTS and node-7.4.0 (latest), and double-checking a handful of tests against the latest stable Chrome (v 55).

Test Coverage

Regex test run successful! Summary: E:\d\RegexTestCollateral had 151147 tests; 0 failures
Internal and slow tests pass.
Note that PRs are merged with the target branch before running Jenkins checks so attempting to run slow tests on this PR would result in failures as per [Slow Tests, Legacy] UnifiedRegex failures in {x64,x86}_{debug,test} (branches: release/1.4, master) #2316 -- but running them locally on this branch, the tests pass.

Reviewers

@tcare @bterlson @boingoing @Cellule - Thank you for your assistance with this change and for your code reviews.

kunalspathak · 2017-01-11T15:23:00Z

What was the scenario that you are trying to fix? Is there need to add some test cases?

dilijev · 2017-01-11T19:46:33Z

@kunalspathak - The initial commit I pushed was just a table normalization change that I just wanted to put up for preview so I could show to others. ~~Coming soon (before merging this) will be additional commits which fix #517 and add tests for the same. (I'll push the additional commits and amend the title and message of this PR at that time.)~~

Edit: the full PR is now up.

… Unicode 8.0 [1]. Fixed by breaking range 0x01B8, 0x01BD of pairs into the individual mappings, following the usual convention in this table of unrolling pairs rather than using an entry showing a range of pairs. Correct behavior under Under case-insensitive /i is: Shows 01B8 == 01B9 under /i. Shows 01BA != 01BB under /i. Shows 01BC == 01BD under /i. Relevant lines of the UnicodeData.txt for Unicode 8.0 [1]: 01B8;LATIN CAPITAL LETTER EZH REVERSED;Lu;0;L;;;;;N;LATIN CAPITAL LETTER REVERSED YOGH;;;01B9; 01B9;LATIN SMALL LETTER EZH REVERSED;Ll;0;L;;;;;N;LATIN SMALL LETTER REVERSED YOGH;;01B8;;01B8 01BA;LATIN SMALL LETTER EZH WITH TAIL;Ll;0;L;;;;;N;LATIN SMALL LETTER YOGH WITH TAIL;;;; 01BB;LATIN LETTER TWO WITH STROKE;Lo;0;L;;;;;N;LATIN LETTER TWO BAR;;;; 01BC;LATIN CAPITAL LETTER TONE FIVE;Lu;0;L;;;;;N;;;;01BD; 01BD;LATIN SMALL LETTER TONE FIVE;Ll;0;L;;;;;N;;;01BC;;01BCC -- [1] Currently fixing bugs in Unicode 8.0 because the source code claims compliance with Unicode 8.0 at the moment. Will update to Unicode 9.0 later.

…) in case-insensitive equivalence table which caused matches to be missed. Correct behavior under Under case-insensitive /i is: 01C4 == 01C5 == 01C6 : DZ WITH CARON (uppercase, titlecase, lowercase) 01C7 == 01C8 == 01C9 : LJ (uppercase, titlecase, lowercase) 01CA == 01CB == 01CC : NJ (uppercase, titlecase, lowercase) 01F1 == 01F2 == 01F3 : DZ (uppercase, titlecase, lowercase) Relevant lines of the UnicodeData.txt for Unicode 8.0 [1][2]: 01C4;LATIN CAPITAL LETTER DZ WITH CARON;Lu;0;L;<compat> 0044 017D;;;;N;LATIN CAPITAL LETTER D Z HACEK;;;01C6;01C5 01C5;LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON;Lt;0;L;<compat> 0044 017E;;;;N;LATIN LETTER CAPITAL D SMALL Z HACEK;;01C4;01C6;01C5 01C6;LATIN SMALL LETTER DZ WITH CARON;Ll;0;L;<compat> 0064 017E;;;;N;LATIN SMALL LETTER D Z HACEK;;01C4;;01C5 01C7;LATIN CAPITAL LETTER LJ;Lu;0;L;<compat> 004C 004A;;;;N;LATIN CAPITAL LETTER L J;;;01C9;01C8 01C8;LATIN CAPITAL LETTER L WITH SMALL LETTER J;Lt;0;L;<compat> 004C 006A;;;;N;LATIN LETTER CAPITAL L SMALL J;;01C7;01C9;01C8 01C9;LATIN SMALL LETTER LJ;Ll;0;L;<compat> 006C 006A;;;;N;LATIN SMALL LETTER L J;;01C7;;01C8 01CA;LATIN CAPITAL LETTER NJ;Lu;0;L;<compat> 004E 004A;;;;N;LATIN CAPITAL LETTER N J;;;01CC;01CB 01CB;LATIN CAPITAL LETTER N WITH SMALL LETTER J;Lt;0;L;<compat> 004E 006A;;;;N;LATIN LETTER CAPITAL N SMALL J;;01CA;01CC;01CB 01CC;LATIN SMALL LETTER NJ;Ll;0;L;<compat> 006E 006A;;;;N;LATIN SMALL LETTER N J;;01CA;;01CB 01F1;LATIN CAPITAL LETTER DZ;Lu;0;L;<compat> 0044 005A;;;;N;;;;01F3;01F2 01F2;LATIN CAPITAL LETTER D WITH SMALL LETTER Z;Lt;0;L;<compat> 0044 007A;;;;N;;;01F1;01F3;01F2 01F3;LATIN SMALL LETTER DZ;Ll;0;L;<compat> 0064 007A;;;;N;;;01F1;;01F2 -- [1] Currently fixing bugs in Unicode 8.0 because the source code claims compliance with Unicode 8.0 at the moment. Will update to Unicode 9.0 later. [2] These lines in Unicode 8.0 are equivalent to the lines in Unicode 9.0.

… added regression tests. Relevant lines of the UnicodeData.txt for Unicode 8.0 [1][2]: 01DE;LATIN CAPITAL LETTER A WITH DIAERESIS AND MACRON;Lu;0;L;00C4 0304;;;;N;LATIN CAPITAL LETTER A DIAERESIS MACRON;;;01DF; 01DF;LATIN SMALL LETTER A WITH DIAERESIS AND MACRON;Ll;0;L;00E4 0304;;;;N;LATIN SMALL LETTER A DIAERESIS MACRON;;01DE;;01DE 01E0;LATIN CAPITAL LETTER A WITH DOT ABOVE AND MACRON;Lu;0;L;0226 0304;;;;N;LATIN CAPITAL LETTER A DOT MACRON;;;01E1; 01E1;LATIN SMALL LETTER A WITH DOT ABOVE AND MACRON;Ll;0;L;0227 0304;;;;N;LATIN SMALL LETTER A DOT MACRON;;01E0;;01E0 01E2;LATIN CAPITAL LETTER AE WITH MACRON;Lu;0;L;00C6 0304;;;;N;LATIN CAPITAL LETTER A E MACRON;;;01E3; 01E3;LATIN SMALL LETTER AE WITH MACRON;Ll;0;L;00E6 0304;;;;N;LATIN SMALL LETTER A E MACRON;;01E2;;01E2 01E4;LATIN CAPITAL LETTER G WITH STROKE;Lu;0;L;;;;;N;LATIN CAPITAL LETTER G BAR;;;01E5; 01E5;LATIN SMALL LETTER G WITH STROKE;Ll;0;L;;;;;N;LATIN SMALL LETTER G BAR;;01E4;;01E4 01E6;LATIN CAPITAL LETTER G WITH CARON;Lu;0;L;0047 030C;;;;N;LATIN CAPITAL LETTER G HACEK;;;01E7; 01E7;LATIN SMALL LETTER G WITH CARON;Ll;0;L;0067 030C;;;;N;LATIN SMALL LETTER G HACEK;;01E6;;01E6 01E8;LATIN CAPITAL LETTER K WITH CARON;Lu;0;L;004B 030C;;;;N;LATIN CAPITAL LETTER K HACEK;;;01E9; 01E9;LATIN SMALL LETTER K WITH CARON;Ll;0;L;006B 030C;;;;N;LATIN SMALL LETTER K HACEK;;01E8;;01E8 01EA;LATIN CAPITAL LETTER O WITH OGONEK;Lu;0;L;004F 0328;;;;N;LATIN CAPITAL LETTER O OGONEK;;;01EB; 01EB;LATIN SMALL LETTER O WITH OGONEK;Ll;0;L;006F 0328;;;;N;LATIN SMALL LETTER O OGONEK;;01EA;;01EA 01EC;LATIN CAPITAL LETTER O WITH OGONEK AND MACRON;Lu;0;L;01EA 0304;;;;N;LATIN CAPITAL LETTER O OGONEK MACRON;;;01ED; 01ED;LATIN SMALL LETTER O WITH OGONEK AND MACRON;Ll;0;L;01EB 0304;;;;N;LATIN SMALL LETTER O OGONEK MACRON;;01EC;;01EC 01EE;LATIN CAPITAL LETTER EZH WITH CARON;Lu;0;L;01B7 030C;;;;N;LATIN CAPITAL LETTER YOGH HACEK;;;01EF; 01EF;LATIN SMALL LETTER EZH WITH CARON;Ll;0;L;0292 030C;;;;N;LATIN SMALL LETTER YOGH HACEK;;01EE;;01EE 01F0;LATIN SMALL LETTER J WITH CARON;Ll;0;L;006A 030C;;;;N;LATIN SMALL LETTER J HACEK;;;; <NO MAPPING> 01F1;LATIN CAPITAL LETTER DZ;Lu;0;L;<compat> 0044 005A;;;;N;;;;01F3;01F2 DZ (uppercase) 01F2;LATIN CAPITAL LETTER D WITH SMALL LETTER Z;Lt;0;L;<compat> 0044 007A;;;;N;;;01F1;01F3;01F2 DZ (titlecase) 01F3;LATIN SMALL LETTER DZ;Ll;0;L;<compat> 0064 007A;;;;N;;;01F1;;01F2 DZ (lowercase) 01F4;LATIN CAPITAL LETTER G WITH ACUTE;Lu;0;L;0047 0301;;;;N;;;;01F5; [3] 01F5;LATIN SMALL LETTER G WITH ACUTE;Ll;0;L;0067 0301;;;;N;;;01F4;;01F4 [3] -- [1] Currently fixing bugs in Unicode 8.0 because the source code claims compliance with Unicode 8.0 at the moment. Will update to Unicode 9.0 later. [2] These lines in Unicode 8.0 are equivalent to the lines in Unicode 9.0. [3] Already included in the table as a pair mapping.

Relevant lines of the UnicodeData.txt for Unicode 8.0 [1][2]: 03F3;GREEK LETTER YOT;Ll;0;L;;;;;N;;;037F;;037F 037F;GREEK CAPITAL LETTER YOT;Lu;0;L;;;;;N;;;;03F3; -- [1] Currently fixing bugs in Unicode 8.0 because the source code claims compliance with Unicode 8.0 at the moment. Will update to Unicode 9.0 later. [2] These lines in Unicode 8.0 are equivalent to the lines in Unicode 9.0.

Relevant lines of the UnicodeData.txt for Unicode 8.0 [1][2]: 0528;CYRILLIC CAPITAL LETTER EN WITH LEFT HOOK;Lu;0;L;;;;;N;;;;0529; 0529;CYRILLIC SMALL LETTER EN WITH LEFT HOOK;Ll;0;L;;;;;N;;;0528;;0528 052A;CYRILLIC CAPITAL LETTER DZZHE;Lu;0;L;;;;;N;;;;052B; 052B;CYRILLIC SMALL LETTER DZZHE;Ll;0;L;;;;;N;;;052A;;052A 052C;CYRILLIC CAPITAL LETTER DCHE;Lu;0;L;;;;;N;;;;052D; 052D;CYRILLIC SMALL LETTER DCHE;Ll;0;L;;;;;N;;;052C;;052C 052E;CYRILLIC CAPITAL LETTER EL WITH DESCENDER;Lu;0;L;;;;;N;;;;052F; 052F;CYRILLIC SMALL LETTER EL WITH DESCENDER;Ll;0;L;;;;;N;;;052E;;052E A698;CYRILLIC CAPITAL LETTER DOUBLE O;Lu;0;L;;;;;N;;;;A699; A699;CYRILLIC SMALL LETTER DOUBLE O;Ll;0;L;;;;;N;;;A698;;A698 A69A;CYRILLIC CAPITAL LETTER CROSSED O;Lu;0;L;;;;;N;;;;A69B; A69B;CYRILLIC SMALL LETTER CROSSED O;Ll;0;L;;;;;N;;;A69A;;A69A -- [1] Currently fixing bugs in Unicode 8.0 because the source code claims compliance with Unicode 8.0 at the moment. Will update to Unicode 9.0 later. [2] These lines in Unicode 8.0 are equivalent to the lines in Unicode 9.0.

Relevant lines of the UnicodeData.txt for Unicode 8.0 [1][2]: Uppercase: 13A0-13EF Lowercase: AB70-ABBF Extra letters in case-mapped pairs: 13F0-13FD -- [1] Currently fixing bugs in Unicode 8.0 because the source code claims compliance with Unicode 8.0 at the moment. Will update to Unicode 9.0 later. [2] These lines in Unicode 8.0 are equivalent to the lines in Unicode 9.0.

…code flag.

Relevant lines of the UnicodeData.txt for Unicode 8.0 [1][2]: A796;LATIN CAPITAL LETTER B WITH FLOURISH;Lu;0;L;;;;;N;;;;A797; A797;LATIN SMALL LETTER B WITH FLOURISH;Ll;0;L;;;;;N;;;A796;;A796 A798;LATIN CAPITAL LETTER F WITH STROKE;Lu;0;L;;;;;N;;;;A799; A799;LATIN SMALL LETTER F WITH STROKE;Ll;0;L;;;;;N;;;A798;;A798 A79A;LATIN CAPITAL LETTER VOLAPUK AE;Lu;0;L;;;;;N;;;;A79B; A79B;LATIN SMALL LETTER VOLAPUK AE;Ll;0;L;;;;;N;;;A79A;;A79A A79C;LATIN CAPITAL LETTER VOLAPUK OE;Lu;0;L;;;;;N;;;;A79D; A79D;LATIN SMALL LETTER VOLAPUK OE;Ll;0;L;;;;;N;;;A79C;;A79C A79E;LATIN CAPITAL LETTER VOLAPUK UE;Lu;0;L;;;;;N;;;;A79F; A79F;LATIN SMALL LETTER VOLAPUK UE;Ll;0;L;;;;;N;;;A79E;;A79E A7AA;LATIN CAPITAL LETTER H WITH HOOK;Lu;0;L;;;;;N;;;;0266; A7AB;LATIN CAPITAL LETTER REVERSED OPEN E;Lu;0;L;;;;;N;;;;025C; A7AC;LATIN CAPITAL LETTER SCRIPT G;Lu;0;L;;;;;N;;;;0261; A7AD;LATIN CAPITAL LETTER L WITH BELT;Lu;0;L;;;;;N;;;;026C; A7B0;LATIN CAPITAL LETTER TURNED K;Lu;0;L;;;;;N;;;;029E; A7B1;LATIN CAPITAL LETTER TURNED T;Lu;0;L;;;;;N;;;;0287; A7B2;LATIN CAPITAL LETTER J WITH CROSSED-TAIL;Lu;0;L;;;;;N;;;;029D; A7B3;LATIN CAPITAL LETTER CHI;Lu;0;L;;;;;N;;;;AB53; A7B4;LATIN CAPITAL LETTER BETA;Lu;0;L;;;;;N;;;;A7B5; A7B5;LATIN SMALL LETTER BETA;Ll;0;L;;;;;N;;;A7B4;;A7B4 A7B6;LATIN CAPITAL LETTER OMEGA;Lu;0;L;;;;;N;;;;A7B7; A7B7;LATIN SMALL LETTER OMEGA;Ll;0;L;;;;;N;;;A7B6;;A7B6 AB53;LATIN SMALL LETTER CHI;Ll;0;L;;;;;N;;;A7B3;;A7B3 -- [1] Currently fixing bugs in Unicode 8.0 because the source code claims compliance with Unicode 8.0 at the moment. Will update to Unicode 9.0 later. [2] These lines in Unicode 8.0 are equivalent to the lines in Unicode 9.0.

… when not using unicode flag.

dilijev · 2017-01-12T07:32:59Z

/cc @bterlson @mathiasbynens - Care to take a look?

Also, any chance you could point me to where in the spec it mentions which case-insensitive matches should not be done without /u flag? For this PR I focused on compat with v8 for those codepoints.

mathiasbynens · 2017-01-12T11:12:49Z

Also, any chance you could point me to where in the spec it mentions which case-insensitive matches should not be done without /u flag?

See step 2 of https://tc39.github.io/ecma262/#sec-runtime-semantics-canonicalize-ch (“If Unicode is true, then … else …”).

I’ll take a closer look at this patch as soon as possible!

tcare · 2017-01-12T18:35:04Z

For me it seems like reviewing the tool is just as important as the change in this PR, so I would encourage the other reviewers to do the same. Perhaps we need a separate PR for that?

dilijev · 2017-01-12T19:17:41Z

@tcare I can submit a PR for that but since I don't plan on merging that in, at least not to release/1.4, maybe a PR against my fork would be better -- just to keep things organized. Also I'd say it's not high-enough quality to merge at this time.

PR: dilijev#3
Source: https://github.com/dilijev/ChakraCore/tree/CaseInsensitive/tools/Unicode/CaseInsensitive

dilijev · 2017-01-12T19:24:52Z

@mathiasbynens Thanks! Basically it seems the difference is without /u mapping under toUpperCase, and with /u, relying on CaseFolding.txt. In a lot of cases, the CaseFolding.txt file provides exactly the same mapping as UnicodeData.txt, but I guess there might be some differences versus what is implemented in toUpperCase?

In the case of our table, we use the MappingSource::UnicodeData line to mark when we do the case mapping irrespective of /u flag (the simple toUpperCase case mappings), and MappingSource::CaseFolding to handle the cases where the CaseFolding mapping provided another alternative that should be considered equivalent.

Basically under /iu we use CaseFolding entries, then fall back on UnicodeData entries if an entry was not found for the code point in question marked as MappingSource::CaseFolding. Under /i we only use the MappingSource::UnicodeData entries.

I'm still surprised about the Cherokee letters, because the Cherokee letters are now split into uppercase and lowercase ranges since Unicode 8.0. It seems like the lowercase/uppercase pairs should still match considering there are a lot of multilingual plane code points with upper/lower pairs for which there is a match without /u.

dilijev · 2017-01-12T19:39:19Z

I actually generated much larger test files which test all of the equivalences captured in this table. I'll add those tests to this PR.

dilijev · 2017-01-13T01:29:02Z

I might look into reducing the size of these test files to save space. Minification!

@curtisman Thinks that the tool which generates the table, once reviewed, could probably be checked in to ChakraCore in master (under a brand-new tools directory) after it is cleaned up a bit.

dilijev · 2017-01-13T02:01:42Z

Didn't save many lines by minifying the tests, but saved nearly 100 KB on final code size. The extra diff in the original test file will still bloat the repo somewhat. Oh well.

… done with /u. Add generated tests. The generated test cases exhaustively test all case-insensitive single-character matches under Unicode 8.0. Could probably have written these tests as a mirror of the table from the source and some logic from the table generator, but this is more straightforward. Includes tests for which cases should not case-fold without /u flag. (Focused on v8 compat.) Minified the function names, and removed spaces and semicolons to save file size at a slight readability cost, but since they're not very readable without an encyclopediac knowledge of unicode code points, readability seems like a non-goal.

tcare · 2017-01-13T21:44:54Z

Looks good, I also had a quick look at the tool and a discussion with Doug offline. We have v8 compat achieved with the tests on this change so I am happy.

dilijev · 2017-01-13T22:00:25Z

Thanks @tcare and @bterlson for the offline discussion and for your help as I worked on this fix!

dilijev · 2017-01-13T22:02:22Z

@mathiasbynens I'm planning to merge this today since I'm going on vacation for 2 weeks and don't want to let this sit here without additional internal coverage. However, I would still greatly appreciate your feedback!

boingoing · 2017-01-13T23:21:54Z

LGTM

dilijev · 2017-01-13T23:43:08Z

Thanks @Cellule and @boingoing for your help reviewing!

@tcare

…nicode 6.3 and later, up to 8.0) to full Unicode 8.0 support. Fixes #517 Merge pull request #2356 from dilijev:unicase Update CaseInsensitive table from hybrid of (Unicode 6.3 and later, up to 8.0) to full Unicode 8.0 support. Fixes #517 Note: The current standard wants Unicode 9.0 but it might be too risky to update that far in a stabilization branch. Opened #2367 to track this work item. The table was generated in the past but then was (mostly) manually edited to include various optimizations and to fix bugs over the years. To make sure we got a complete update, I wrote a tool to generate the table. ## CaseInsensitive mapping generator tool PR: dilijev#3 Source: https://github.com/dilijev/ChakraCore/tree/CaseInsensitive/tools/Unicode/CaseInsensitive From this tool I was able to see and apply the differences from the current implementation to the correct implementation. In order to keep the change as small as possible, I used the diff as a reference for what needed changing and left out non-essential diffs. Additionally, the tool generates a suite of tests to track regressions against the update and ensure that the implementation does what is expected. I took some key tests from that suite and created the test file contained in this PR. # Overview of Changes I have staged the changes to hopefully make this easier to review. Here's an overview. NOTE: The individual commits list or summarize the relevant lines of UnicodeData.txt where applicable. First, I normalized the existing table to a reasonable format (same as the output of the tool) to make the later commits more clear. This involves fixing the casing and sorting deltas on each line in ascending order. 3d0f37f Next, I fixed a few bugs with the current table that were preventing some cases from being matched correctly. abb5d91 4894d24 25049de Added new codepoints: f197902 - GREEK LETTER YOT af2d083 - Cyrillic cba5439 - Cherokee 6c25a51 - Latin extensions Other tests and fixes: cb736ab - Add test cases from #517 to ensure those issues are fixed. fbfb953 - 0x0345 and case-insensitive equivalent characters with or without /u flag. dc3e750 - Case-insensitive matching for Cherokee only with /u. [1] d96eed5 - All other Unicode 8.0 cases of case-insensitive matching only with /u. [1] Added generated tests. [1] These were with a focus on compat with v8 as determined by running the full regression test suite I generated against node-6.9.4-LTS and node-7.4.0 (latest), and double-checking a handful of tests against the latest stable Chrome (v 55). # Test Coverage * **Regex test run successful!** `Summary: E:\d\RegexTestCollateral had 151147 tests; 0 failures` * Internal and slow tests pass. * Note that PRs are merged with the target branch before running Jenkins checks so attempting to run slow tests on this PR would result in failures as per #2316 -- but running them locally on this branch, the tests pass. # Reviewers @tcare @bterlson @boingoing @Cellule - Thank you for your assistance with this change and for your code reviews.

@tcare

… hybrid of (Unicode 6.3 and later, up to 8.0) to full Unicode 8.0 support. Fixes #517 Merge pull request #2356 from dilijev:unicase Update CaseInsensitive table from hybrid of (Unicode 6.3 and later, up to 8.0) to full Unicode 8.0 support. Fixes #517 Note: The current standard wants Unicode 9.0 but it might be too risky to update that far in a stabilization branch. Opened #2367 to track this work item. The table was generated in the past but then was (mostly) manually edited to include various optimizations and to fix bugs over the years. To make sure we got a complete update, I wrote a tool to generate the table. ## CaseInsensitive mapping generator tool PR: dilijev#3 Source: https://github.com/dilijev/ChakraCore/tree/CaseInsensitive/tools/Unicode/CaseInsensitive From this tool I was able to see and apply the differences from the current implementation to the correct implementation. In order to keep the change as small as possible, I used the diff as a reference for what needed changing and left out non-essential diffs. Additionally, the tool generates a suite of tests to track regressions against the update and ensure that the implementation does what is expected. I took some key tests from that suite and created the test file contained in this PR. # Overview of Changes I have staged the changes to hopefully make this easier to review. Here's an overview. NOTE: The individual commits list or summarize the relevant lines of UnicodeData.txt where applicable. First, I normalized the existing table to a reasonable format (same as the output of the tool) to make the later commits more clear. This involves fixing the casing and sorting deltas on each line in ascending order. 3d0f37f Next, I fixed a few bugs with the current table that were preventing some cases from being matched correctly. abb5d91 4894d24 25049de Added new codepoints: f197902 - GREEK LETTER YOT af2d083 - Cyrillic cba5439 - Cherokee 6c25a51 - Latin extensions Other tests and fixes: cb736ab - Add test cases from #517 to ensure those issues are fixed. fbfb953 - 0x0345 and case-insensitive equivalent characters with or without /u flag. dc3e750 - Case-insensitive matching for Cherokee only with /u. [1] d96eed5 - All other Unicode 8.0 cases of case-insensitive matching only with /u. [1] Added generated tests. [1] These were with a focus on compat with v8 as determined by running the full regression test suite I generated against node-6.9.4-LTS and node-7.4.0 (latest), and double-checking a handful of tests against the latest stable Chrome (v 55). # Test Coverage * **Regex test run successful!** `Summary: E:\d\RegexTestCollateral had 151147 tests; 0 failures` * Internal and slow tests pass. * Note that PRs are merged with the target branch before running Jenkins checks so attempting to run slow tests on this PR would result in failures as per #2316 -- but running them locally on this branch, the tests pass. # Reviewers @tcare @bterlson @boingoing @Cellule - Thank you for your assistance with this change and for your code reviews.

mathiasbynens · 2017-01-14T09:17:51Z

Belated LGTM, although I skipped the generated files in favor of the code that generates them, i.e. dilijev#3. 👍 Glad to see this merged!

msftclas added the cla-not-required label Jan 11, 2017

dilijev added 11 commits January 11, 2017 14:11

Normalize case and sort order of CaseInsensitive table.

3d0f37f

Test case format update.

e27454f

Add note: tests should pass even when the regex does not have the Uni…

98b352c

…code flag.

Add provided test case from chakra-core#517.

cb736ab

dilijev force-pushed the unicase branch from ba959c7 to cb736ab Compare January 12, 2017 04:01

dilijev added 4 commits January 11, 2017 20:27

Fix test.

bf1ece5

Reduced set of tests and fixed table to enable 0x0345 set equivalence…

fbfb953

… when not using unicode flag.

Only case-fold Cherokee under unicode flag (matches v8's behavior).

dc3e750

Remove self-checks from tests.

16e8307

dilijev changed the title ~~Normalize case and sort order of CaseInsensitive table.~~ Update CaseInsensitive table from hybridized Unicode 6.3-8.0 to full Unicode 8.0. Fixes #517 Jan 12, 2017

dilijev mentioned this pull request Jan 12, 2017

Update CaseInsensitive table to Unicode 9.0 #2367

Open

dilijev requested review from curtisman, ianwjhalliday and tcare January 12, 2017 07:31

dilijev requested a review from boingoing January 12, 2017 07:33

dilijev changed the title ~~Update CaseInsensitive table from hybridized Unicode 6.3-8.0 to full Unicode 8.0. Fixes #517~~ Update CaseInsensitive table from hybrid of Unicode 6.3 and later to full Unicode 8.0 support. Fixes #517 Jan 12, 2017

dilijev changed the title ~~Update CaseInsensitive table from hybrid of Unicode 6.3 and later to full Unicode 8.0 support. Fixes #517~~ Update CaseInsensitive table from hybrid of (Unicode 6.3 and later, up to 8.0) to full Unicode 8.0 support. Fixes #517 Jan 12, 2017

dilijev mentioned this pull request Jan 12, 2017

CaseInsensitive mapping generator tool dilijev/ChakraCore#3

Open

Updated slow test baseline for improved case equivalence matching.

64f2bb1

dilijev force-pushed the unicase branch from fb34d24 to 992f572 Compare January 13, 2017 01:56

dilijev force-pushed the unicase branch from 992f572 to d96eed5 Compare January 13, 2017 05:37

Cellule approved these changes Jan 13, 2017

View reviewed changes

chakrabot merged commit d96eed5 into chakra-core:release/1.4 Jan 13, 2017

dilijev deleted the unicase branch January 14, 2017 00:12

dilijev mentioned this pull request Apr 6, 2017

[Question] Format for Unicode property escape test suite? tc39/test262#950

Closed

dilijev mentioned this pull request Nov 7, 2017

Code quality: in CaseInsensitive.cpp, add braces around entries and remove defunct gawk scripts #4152

Merged

Update CaseInsensitive table from hybrid of (Unicode 6.3 and later, up to 8.0) to full Unicode 8.0 support. Fixes #517 #2356

Update CaseInsensitive table from hybrid of (Unicode 6.3 and later, up to 8.0) to full Unicode 8.0 support. Fixes #517 #2356

Uh oh!

Conversation

dilijev commented Jan 11, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CaseInsensitive mapping generator tool

Overview of Changes

Test Coverage

Reviewers

Uh oh!

kunalspathak commented Jan 11, 2017

Uh oh!

dilijev commented Jan 11, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dilijev commented Jan 12, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mathiasbynens commented Jan 12, 2017

Uh oh!

tcare commented Jan 12, 2017

Uh oh!

dilijev commented Jan 12, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dilijev commented Jan 12, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dilijev commented Jan 12, 2017

Uh oh!

dilijev commented Jan 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dilijev commented Jan 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tcare commented Jan 13, 2017

Uh oh!

dilijev commented Jan 13, 2017

Uh oh!

dilijev commented Jan 13, 2017

Uh oh!

boingoing commented Jan 13, 2017

Uh oh!

dilijev commented Jan 13, 2017

Uh oh!

mathiasbynens commented Jan 14, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

dilijev commented Jan 11, 2017 •

edited

Loading

dilijev commented Jan 11, 2017 •

edited

Loading

dilijev commented Jan 12, 2017 •

edited

Loading

dilijev commented Jan 12, 2017 •

edited

Loading

dilijev commented Jan 12, 2017 •

edited

Loading

dilijev commented Jan 13, 2017 •

edited

Loading

dilijev commented Jan 13, 2017 •

edited

Loading