Skip to content

Conversation

@dilijev
Copy link
Contributor

@dilijev dilijev commented Jan 11, 2017

Update CaseInsensitive table from hybrid of (Unicode 6.3 and later, up to 8.0) to full Unicode 8.0 support.

Fixes #517

Note: The current standard wants Unicode 9.0 but it might be too risky to update that far in a stabilization branch. Opened #2367 to track this work item.

The table was generated in the past but then was (mostly) manually edited to include various optimizations and to fix bugs over the years. To make sure we got a complete update, I wrote a tool to generate the table.

CaseInsensitive mapping generator tool

PR: dilijev#3
Source: https://github.com/dilijev/ChakraCore/tree/CaseInsensitive/tools/Unicode/CaseInsensitive

From this tool I was able to see and apply the differences from the current implementation to the correct implementation. In order to keep the change as small as possible, I used the diff as a reference for what needed changing and left out non-essential diffs. Additionally, the tool generates a suite of tests to track regressions against the update and ensure that the implementation does what is expected. I took some key tests from that suite and created the test file contained in this PR.

Overview of Changes

I have staged the changes to hopefully make this easier to review. Here's an overview.

NOTE: The individual commits list or summarize the relevant lines of UnicodeData.txt where applicable.

First, I normalized the existing table to a reasonable format (same as the output of the tool) to make the later commits more clear. This involves fixing the casing and sorting deltas on each line in ascending order.

3d0f37f

Next, I fixed a few bugs with the current table that were preventing some cases from being matched correctly.

abb5d91
4894d24
25049de

Added new codepoints:

f197902 - GREEK LETTER YOT
af2d083 - Cyrillic
cba5439 - Cherokee
6c25a51 - Latin extensions

Other tests and fixes:

cb736ab - Add test cases from #517 to ensure those issues are fixed.
fbfb953 - 0x0345 and case-insensitive equivalent characters with or without /u flag.
dc3e750 - Case-insensitive matching for Cherokee only with /u. [1]
d96eed5 - All other Unicode 8.0 cases of case-insensitive matching only with /u. [1] Added generated tests.

[1] These were with a focus on compat with v8 as determined by running the full regression test suite I generated against node-6.9.4-LTS and node-7.4.0 (latest), and double-checking a handful of tests against the latest stable Chrome (v 55).

Test Coverage

Reviewers

@tcare @bterlson @boingoing @Cellule - Thank you for your assistance with this change and for your code reviews.

@kunalspathak
Copy link
Contributor

What was the scenario that you are trying to fix? Is there need to add some test cases?

@dilijev
Copy link
Contributor Author

dilijev commented Jan 11, 2017

@kunalspathak - The initial commit I pushed was just a table normalization change that I just wanted to put up for preview so I could show to others. Coming soon (before merging this) will be additional commits which fix #517 and add tests for the same. (I'll push the additional commits and amend the title and message of this PR at that time.)

Edit: the full PR is now up.

… Unicode 8.0 [1].

Fixed by breaking range 0x01B8, 0x01BD of pairs into the individual mappings, following the usual convention in this table of unrolling pairs rather than using an entry showing a range of pairs.

Correct behavior under Under case-insensitive /i is:

Shows 01B8 == 01B9 under /i.
Shows 01BA != 01BB under /i.
Shows 01BC == 01BD under /i.

Relevant lines of the UnicodeData.txt for Unicode 8.0 [1]:

01B8;LATIN CAPITAL LETTER EZH REVERSED;Lu;0;L;;;;;N;LATIN CAPITAL LETTER REVERSED YOGH;;;01B9;
01B9;LATIN SMALL LETTER EZH REVERSED;Ll;0;L;;;;;N;LATIN SMALL LETTER REVERSED YOGH;;01B8;;01B8
01BA;LATIN SMALL LETTER EZH WITH TAIL;Ll;0;L;;;;;N;LATIN SMALL LETTER YOGH WITH TAIL;;;;
01BB;LATIN LETTER TWO WITH STROKE;Lo;0;L;;;;;N;LATIN LETTER TWO BAR;;;;
01BC;LATIN CAPITAL LETTER TONE FIVE;Lu;0;L;;;;;N;;;;01BD;
01BD;LATIN SMALL LETTER TONE FIVE;Ll;0;L;;;;;N;;;01BC;;01BCC

--

[1] Currently fixing bugs in Unicode 8.0 because the source code claims compliance with Unicode 8.0 at the moment. Will update to Unicode 9.0 later.
…) in case-insensitive equivalence table which caused matches to be missed.

Correct behavior under Under case-insensitive /i is:

01C4 == 01C5 == 01C6 : DZ WITH CARON (uppercase, titlecase, lowercase)
01C7 == 01C8 == 01C9 : LJ            (uppercase, titlecase, lowercase)
01CA == 01CB == 01CC : NJ            (uppercase, titlecase, lowercase)
01F1 == 01F2 == 01F3 : DZ            (uppercase, titlecase, lowercase)

Relevant lines of the UnicodeData.txt for Unicode 8.0 [1][2]:

01C4;LATIN CAPITAL LETTER DZ WITH CARON;Lu;0;L;<compat> 0044 017D;;;;N;LATIN CAPITAL LETTER D Z HACEK;;;01C6;01C5
01C5;LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON;Lt;0;L;<compat> 0044 017E;;;;N;LATIN LETTER CAPITAL D SMALL Z HACEK;;01C4;01C6;01C5
01C6;LATIN SMALL LETTER DZ WITH CARON;Ll;0;L;<compat> 0064 017E;;;;N;LATIN SMALL LETTER D Z HACEK;;01C4;;01C5
01C7;LATIN CAPITAL LETTER LJ;Lu;0;L;<compat> 004C 004A;;;;N;LATIN CAPITAL LETTER L J;;;01C9;01C8
01C8;LATIN CAPITAL LETTER L WITH SMALL LETTER J;Lt;0;L;<compat> 004C 006A;;;;N;LATIN LETTER CAPITAL L SMALL J;;01C7;01C9;01C8
01C9;LATIN SMALL LETTER LJ;Ll;0;L;<compat> 006C 006A;;;;N;LATIN SMALL LETTER L J;;01C7;;01C8
01CA;LATIN CAPITAL LETTER NJ;Lu;0;L;<compat> 004E 004A;;;;N;LATIN CAPITAL LETTER N J;;;01CC;01CB
01CB;LATIN CAPITAL LETTER N WITH SMALL LETTER J;Lt;0;L;<compat> 004E 006A;;;;N;LATIN LETTER CAPITAL N SMALL J;;01CA;01CC;01CB
01CC;LATIN SMALL LETTER NJ;Ll;0;L;<compat> 006E 006A;;;;N;LATIN SMALL LETTER N J;;01CA;;01CB

01F1;LATIN CAPITAL LETTER DZ;Lu;0;L;<compat> 0044 005A;;;;N;;;;01F3;01F2
01F2;LATIN CAPITAL LETTER D WITH SMALL LETTER Z;Lt;0;L;<compat> 0044 007A;;;;N;;;01F1;01F3;01F2
01F3;LATIN SMALL LETTER DZ;Ll;0;L;<compat> 0064 007A;;;;N;;;01F1;;01F2

--

[1] Currently fixing bugs in Unicode 8.0 because the source code claims compliance with Unicode 8.0 at the moment. Will update to Unicode 9.0 later.
[2] These lines in Unicode 8.0 are equivalent to the lines in Unicode 9.0.
… added regression tests.

Relevant lines of the UnicodeData.txt for Unicode 8.0 [1][2]:

01DE;LATIN CAPITAL LETTER A WITH DIAERESIS AND MACRON;Lu;0;L;00C4 0304;;;;N;LATIN CAPITAL LETTER A DIAERESIS MACRON;;;01DF;
01DF;LATIN SMALL LETTER A WITH DIAERESIS AND MACRON;Ll;0;L;00E4 0304;;;;N;LATIN SMALL LETTER A DIAERESIS MACRON;;01DE;;01DE
01E0;LATIN CAPITAL LETTER A WITH DOT ABOVE AND MACRON;Lu;0;L;0226 0304;;;;N;LATIN CAPITAL LETTER A DOT MACRON;;;01E1;
01E1;LATIN SMALL LETTER A WITH DOT ABOVE AND MACRON;Ll;0;L;0227 0304;;;;N;LATIN SMALL LETTER A DOT MACRON;;01E0;;01E0
01E2;LATIN CAPITAL LETTER AE WITH MACRON;Lu;0;L;00C6 0304;;;;N;LATIN CAPITAL LETTER A E MACRON;;;01E3;
01E3;LATIN SMALL LETTER AE WITH MACRON;Ll;0;L;00E6 0304;;;;N;LATIN SMALL LETTER A E MACRON;;01E2;;01E2
01E4;LATIN CAPITAL LETTER G WITH STROKE;Lu;0;L;;;;;N;LATIN CAPITAL LETTER G BAR;;;01E5;
01E5;LATIN SMALL LETTER G WITH STROKE;Ll;0;L;;;;;N;LATIN SMALL LETTER G BAR;;01E4;;01E4
01E6;LATIN CAPITAL LETTER G WITH CARON;Lu;0;L;0047 030C;;;;N;LATIN CAPITAL LETTER G HACEK;;;01E7;
01E7;LATIN SMALL LETTER G WITH CARON;Ll;0;L;0067 030C;;;;N;LATIN SMALL LETTER G HACEK;;01E6;;01E6
01E8;LATIN CAPITAL LETTER K WITH CARON;Lu;0;L;004B 030C;;;;N;LATIN CAPITAL LETTER K HACEK;;;01E9;
01E9;LATIN SMALL LETTER K WITH CARON;Ll;0;L;006B 030C;;;;N;LATIN SMALL LETTER K HACEK;;01E8;;01E8
01EA;LATIN CAPITAL LETTER O WITH OGONEK;Lu;0;L;004F 0328;;;;N;LATIN CAPITAL LETTER O OGONEK;;;01EB;
01EB;LATIN SMALL LETTER O WITH OGONEK;Ll;0;L;006F 0328;;;;N;LATIN SMALL LETTER O OGONEK;;01EA;;01EA
01EC;LATIN CAPITAL LETTER O WITH OGONEK AND MACRON;Lu;0;L;01EA 0304;;;;N;LATIN CAPITAL LETTER O OGONEK MACRON;;;01ED;
01ED;LATIN SMALL LETTER O WITH OGONEK AND MACRON;Ll;0;L;01EB 0304;;;;N;LATIN SMALL LETTER O OGONEK MACRON;;01EC;;01EC
01EE;LATIN CAPITAL LETTER EZH WITH CARON;Lu;0;L;01B7 030C;;;;N;LATIN CAPITAL LETTER YOGH HACEK;;;01EF;
01EF;LATIN SMALL LETTER EZH WITH CARON;Ll;0;L;0292 030C;;;;N;LATIN SMALL LETTER YOGH HACEK;;01EE;;01EE

01F0;LATIN SMALL LETTER J WITH CARON;Ll;0;L;006A 030C;;;;N;LATIN SMALL LETTER J HACEK;;;;               <NO MAPPING>

01F1;LATIN CAPITAL LETTER DZ;Lu;0;L;<compat> 0044 005A;;;;N;;;;01F3;01F2                                DZ (uppercase)
01F2;LATIN CAPITAL LETTER D WITH SMALL LETTER Z;Lt;0;L;<compat> 0044 007A;;;;N;;;01F1;01F3;01F2         DZ (titlecase)
01F3;LATIN SMALL LETTER DZ;Ll;0;L;<compat> 0064 007A;;;;N;;;01F1;;01F2                                  DZ (lowercase)

01F4;LATIN CAPITAL LETTER G WITH ACUTE;Lu;0;L;0047 0301;;;;N;;;;01F5;                                   [3]
01F5;LATIN SMALL LETTER G WITH ACUTE;Ll;0;L;0067 0301;;;;N;;;01F4;;01F4                                 [3]

--

[1] Currently fixing bugs in Unicode 8.0 because the source code claims compliance with Unicode 8.0 at the moment. Will update to Unicode 9.0 later.
[2] These lines in Unicode 8.0 are equivalent to the lines in Unicode 9.0.
[3] Already included in the table as a pair mapping.
Relevant lines of the UnicodeData.txt for Unicode 8.0 [1][2]:

03F3;GREEK LETTER YOT;Ll;0;L;;;;;N;;;037F;;037F
037F;GREEK CAPITAL LETTER YOT;Lu;0;L;;;;;N;;;;03F3;

--

[1] Currently fixing bugs in Unicode 8.0 because the source code claims compliance with Unicode 8.0 at the moment. Will update to Unicode 9.0 later.
[2] These lines in Unicode 8.0 are equivalent to the lines in Unicode 9.0.
Relevant lines of the UnicodeData.txt for Unicode 8.0 [1][2]:

0528;CYRILLIC CAPITAL LETTER EN WITH LEFT HOOK;Lu;0;L;;;;;N;;;;0529;
0529;CYRILLIC SMALL LETTER EN WITH LEFT HOOK;Ll;0;L;;;;;N;;;0528;;0528
052A;CYRILLIC CAPITAL LETTER DZZHE;Lu;0;L;;;;;N;;;;052B;
052B;CYRILLIC SMALL LETTER DZZHE;Ll;0;L;;;;;N;;;052A;;052A
052C;CYRILLIC CAPITAL LETTER DCHE;Lu;0;L;;;;;N;;;;052D;
052D;CYRILLIC SMALL LETTER DCHE;Ll;0;L;;;;;N;;;052C;;052C
052E;CYRILLIC CAPITAL LETTER EL WITH DESCENDER;Lu;0;L;;;;;N;;;;052F;
052F;CYRILLIC SMALL LETTER EL WITH DESCENDER;Ll;0;L;;;;;N;;;052E;;052E

A698;CYRILLIC CAPITAL LETTER DOUBLE O;Lu;0;L;;;;;N;;;;A699;
A699;CYRILLIC SMALL LETTER DOUBLE O;Ll;0;L;;;;;N;;;A698;;A698
A69A;CYRILLIC CAPITAL LETTER CROSSED O;Lu;0;L;;;;;N;;;;A69B;
A69B;CYRILLIC SMALL LETTER CROSSED O;Ll;0;L;;;;;N;;;A69A;;A69A

--

[1] Currently fixing bugs in Unicode 8.0 because the source code claims compliance with Unicode 8.0 at the moment. Will update to Unicode 9.0 later.
[2] These lines in Unicode 8.0 are equivalent to the lines in Unicode 9.0.
Relevant lines of the UnicodeData.txt for Unicode 8.0 [1][2]:

Uppercase: 13A0-13EF
Lowercase: AB70-ABBF
Extra letters in case-mapped pairs: 13F0-13FD

--

[1] Currently fixing bugs in Unicode 8.0 because the source code claims compliance with Unicode 8.0 at the moment. Will update to Unicode 9.0 later.
[2] These lines in Unicode 8.0 are equivalent to the lines in Unicode 9.0.
Relevant lines of the UnicodeData.txt for Unicode 8.0 [1][2]:

A796;LATIN CAPITAL LETTER B WITH FLOURISH;Lu;0;L;;;;;N;;;;A797;
A797;LATIN SMALL LETTER B WITH FLOURISH;Ll;0;L;;;;;N;;;A796;;A796
A798;LATIN CAPITAL LETTER F WITH STROKE;Lu;0;L;;;;;N;;;;A799;
A799;LATIN SMALL LETTER F WITH STROKE;Ll;0;L;;;;;N;;;A798;;A798
A79A;LATIN CAPITAL LETTER VOLAPUK AE;Lu;0;L;;;;;N;;;;A79B;
A79B;LATIN SMALL LETTER VOLAPUK AE;Ll;0;L;;;;;N;;;A79A;;A79A
A79C;LATIN CAPITAL LETTER VOLAPUK OE;Lu;0;L;;;;;N;;;;A79D;
A79D;LATIN SMALL LETTER VOLAPUK OE;Ll;0;L;;;;;N;;;A79C;;A79C
A79E;LATIN CAPITAL LETTER VOLAPUK UE;Lu;0;L;;;;;N;;;;A79F;
A79F;LATIN SMALL LETTER VOLAPUK UE;Ll;0;L;;;;;N;;;A79E;;A79E

A7AA;LATIN CAPITAL LETTER H WITH HOOK;Lu;0;L;;;;;N;;;;0266;
A7AB;LATIN CAPITAL LETTER REVERSED OPEN E;Lu;0;L;;;;;N;;;;025C;
A7AC;LATIN CAPITAL LETTER SCRIPT G;Lu;0;L;;;;;N;;;;0261;
A7AD;LATIN CAPITAL LETTER L WITH BELT;Lu;0;L;;;;;N;;;;026C;
A7B0;LATIN CAPITAL LETTER TURNED K;Lu;0;L;;;;;N;;;;029E;
A7B1;LATIN CAPITAL LETTER TURNED T;Lu;0;L;;;;;N;;;;0287;
A7B2;LATIN CAPITAL LETTER J WITH CROSSED-TAIL;Lu;0;L;;;;;N;;;;029D;

A7B3;LATIN CAPITAL LETTER CHI;Lu;0;L;;;;;N;;;;AB53;

A7B4;LATIN CAPITAL LETTER BETA;Lu;0;L;;;;;N;;;;A7B5;
A7B5;LATIN SMALL LETTER BETA;Ll;0;L;;;;;N;;;A7B4;;A7B4
A7B6;LATIN CAPITAL LETTER OMEGA;Lu;0;L;;;;;N;;;;A7B7;
A7B7;LATIN SMALL LETTER OMEGA;Ll;0;L;;;;;N;;;A7B6;;A7B6

AB53;LATIN SMALL LETTER CHI;Ll;0;L;;;;;N;;;A7B3;;A7B3

--

[1] Currently fixing bugs in Unicode 8.0 because the source code claims compliance with Unicode 8.0 at the moment. Will update to Unicode 9.0 later.
[2] These lines in Unicode 8.0 are equivalent to the lines in Unicode 9.0.
@dilijev dilijev changed the title Normalize case and sort order of CaseInsensitive table. Update CaseInsensitive table from hybridized Unicode 6.3-8.0 to full Unicode 8.0. Fixes #517 Jan 12, 2017
@dilijev
Copy link
Contributor Author

dilijev commented Jan 12, 2017

/cc @bterlson @mathiasbynens - Care to take a look?

Also, any chance you could point me to where in the spec it mentions which case-insensitive matches should not be done without /u flag? For this PR I focused on compat with v8 for those codepoints.

@dilijev dilijev requested a review from boingoing January 12, 2017 07:33
@dilijev dilijev changed the title Update CaseInsensitive table from hybridized Unicode 6.3-8.0 to full Unicode 8.0. Fixes #517 Update CaseInsensitive table from hybrid of Unicode 6.3 and later to full Unicode 8.0 support. Fixes #517 Jan 12, 2017
@dilijev dilijev changed the title Update CaseInsensitive table from hybrid of Unicode 6.3 and later to full Unicode 8.0 support. Fixes #517 Update CaseInsensitive table from hybrid of (Unicode 6.3 and later, up to 8.0) to full Unicode 8.0 support. Fixes #517 Jan 12, 2017
@mathiasbynens
Copy link
Contributor

Also, any chance you could point me to where in the spec it mentions which case-insensitive matches should not be done without /u flag?

See step 2 of https://tc39.github.io/ecma262/#sec-runtime-semantics-canonicalize-ch (“If Unicode is true, then … else …”).

I’ll take a closer look at this patch as soon as possible!

@tcare
Copy link
Contributor

tcare commented Jan 12, 2017

For me it seems like reviewing the tool is just as important as the change in this PR, so I would encourage the other reviewers to do the same. Perhaps we need a separate PR for that?

@dilijev
Copy link
Contributor Author

dilijev commented Jan 12, 2017

@tcare I can submit a PR for that but since I don't plan on merging that in, at least not to release/1.4, maybe a PR against my fork would be better -- just to keep things organized. Also I'd say it's not high-enough quality to merge at this time.

PR: dilijev#3
Source: https://github.com/dilijev/ChakraCore/tree/CaseInsensitive/tools/Unicode/CaseInsensitive

@dilijev
Copy link
Contributor Author

dilijev commented Jan 12, 2017

@mathiasbynens Thanks! Basically it seems the difference is without /u mapping under toUpperCase, and with /u, relying on CaseFolding.txt. In a lot of cases, the CaseFolding.txt file provides exactly the same mapping as UnicodeData.txt, but I guess there might be some differences versus what is implemented in toUpperCase?

In the case of our table, we use the MappingSource::UnicodeData line to mark when we do the case mapping irrespective of /u flag (the simple toUpperCase case mappings), and MappingSource::CaseFolding to handle the cases where the CaseFolding mapping provided another alternative that should be considered equivalent.

Basically under /iu we use CaseFolding entries, then fall back on UnicodeData entries if an entry was not found for the code point in question marked as MappingSource::CaseFolding. Under /i we only use the MappingSource::UnicodeData entries.

I'm still surprised about the Cherokee letters, because the Cherokee letters are now split into uppercase and lowercase ranges since Unicode 8.0. It seems like the lowercase/uppercase pairs should still match considering there are a lot of multilingual plane code points with upper/lower pairs for which there is a match without /u.

@dilijev
Copy link
Contributor Author

dilijev commented Jan 12, 2017

I actually generated much larger test files which test all of the equivalences captured in this table. I'll add those tests to this PR.

@dilijev
Copy link
Contributor Author

dilijev commented Jan 13, 2017

I might look into reducing the size of these test files to save space. Minification!

@curtisman Thinks that the tool which generates the table, once reviewed, could probably be checked in to ChakraCore in master (under a brand-new tools directory) after it is cleaned up a bit.

@dilijev
Copy link
Contributor Author

dilijev commented Jan 13, 2017

Didn't save many lines by minifying the tests, but saved nearly 100 KB on final code size. The extra diff in the original test file will still bloat the repo somewhat. Oh well.

… done with /u. Add generated tests.

The generated test cases exhaustively test all case-insensitive single-character matches under Unicode 8.0. Could probably have written these tests as a mirror of the table from the source and some logic from the table generator, but this is more straightforward.

Includes tests for which cases should not case-fold without /u flag. (Focused on v8 compat.)

Minified the function names, and removed spaces and semicolons to save file size at a slight readability cost, but since they're not very readable without an encyclopediac knowledge of unicode code points, readability seems like a non-goal.
@tcare
Copy link
Contributor

tcare commented Jan 13, 2017

Looks good, I also had a quick look at the tool and a discussion with Doug offline. We have v8 compat achieved with the tests on this change so I am happy.

@dilijev
Copy link
Contributor Author

dilijev commented Jan 13, 2017

Thanks @tcare and @bterlson for the offline discussion and for your help as I worked on this fix!

@dilijev
Copy link
Contributor Author

dilijev commented Jan 13, 2017

@mathiasbynens I'm planning to merge this today since I'm going on vacation for 2 weeks and don't want to let this sit here without additional internal coverage. However, I would still greatly appreciate your feedback!

@boingoing
Copy link
Contributor

LGTM

@dilijev
Copy link
Contributor Author

dilijev commented Jan 13, 2017

Thanks @Cellule and @boingoing for your help reviewing!

@chakrabot chakrabot merged commit d96eed5 into chakra-core:release/1.4 Jan 13, 2017
chakrabot pushed a commit that referenced this pull request Jan 13, 2017
…nicode 6.3 and later, up to 8.0) to full Unicode 8.0 support. Fixes #517

Merge pull request #2356 from dilijev:unicase

Update CaseInsensitive table from hybrid of (Unicode 6.3 and later, up to 8.0) to full Unicode 8.0 support.

Fixes #517

Note: The current standard wants Unicode 9.0 but it might be too risky to update that far in a stabilization branch. Opened #2367 to track this work item.

The table was generated in the past but then was (mostly) manually edited to include various optimizations and to fix bugs over the years. To make sure we got a complete update, I wrote a tool to generate the table.

## CaseInsensitive mapping generator tool
PR: dilijev#3
Source: https://github.com/dilijev/ChakraCore/tree/CaseInsensitive/tools/Unicode/CaseInsensitive

From this tool I was able to see and apply the differences from the current implementation to the correct implementation. In order to keep the change as small as possible, I used the diff as a reference for what needed changing and left out non-essential diffs. Additionally, the tool generates a suite of tests to track regressions against the update and ensure that the implementation does what is expected. I took some key tests from that suite and created the test file contained in this PR.

# Overview of Changes

I have staged the changes to hopefully make this easier to review. Here's an overview.

NOTE: The individual commits list or summarize the relevant lines of UnicodeData.txt where applicable.

First, I normalized the existing table to a reasonable format (same as the output of the tool) to make the later commits more clear. This involves fixing the casing and sorting deltas on each line in ascending order.

3d0f37f

Next, I fixed a few bugs with the current table that were preventing some cases from being matched correctly.

abb5d91
4894d24
25049de

Added new codepoints:

f197902 - GREEK LETTER YOT
af2d083 - Cyrillic
cba5439 - Cherokee
6c25a51 - Latin extensions

Other tests and fixes:

cb736ab - Add test cases from #517 to ensure those issues are fixed.
fbfb953 - 0x0345 and case-insensitive equivalent characters with or without /u flag.
dc3e750 - Case-insensitive matching for Cherokee only with /u. [1]
d96eed5 - All other Unicode 8.0 cases of case-insensitive matching only with /u. [1] Added generated tests.

[1] These were with a focus on compat with v8 as determined by running the full regression test suite I generated against node-6.9.4-LTS and node-7.4.0 (latest), and double-checking a handful of tests against the latest stable Chrome (v 55).

# Test Coverage

* **Regex test run successful!** `Summary: E:\d\RegexTestCollateral had 151147 tests; 0 failures`
* Internal and slow tests pass.
* Note that PRs are merged with the target branch before running Jenkins checks so attempting to run slow tests on this PR would result in failures as per #2316 -- but running them locally on this branch, the tests pass.

# Reviewers

@tcare @bterlson @boingoing @Cellule - Thank you for your assistance with this change and for your code reviews.
chakrabot pushed a commit that referenced this pull request Jan 14, 2017
… hybrid of (Unicode 6.3 and later, up to 8.0) to full Unicode 8.0 support. Fixes #517

Merge pull request #2356 from dilijev:unicase

Update CaseInsensitive table from hybrid of (Unicode 6.3 and later, up to 8.0) to full Unicode 8.0 support.

Fixes #517

Note: The current standard wants Unicode 9.0 but it might be too risky to update that far in a stabilization branch. Opened #2367 to track this work item.

The table was generated in the past but then was (mostly) manually edited to include various optimizations and to fix bugs over the years. To make sure we got a complete update, I wrote a tool to generate the table.

## CaseInsensitive mapping generator tool
PR: dilijev#3
Source: https://github.com/dilijev/ChakraCore/tree/CaseInsensitive/tools/Unicode/CaseInsensitive

From this tool I was able to see and apply the differences from the current implementation to the correct implementation. In order to keep the change as small as possible, I used the diff as a reference for what needed changing and left out non-essential diffs. Additionally, the tool generates a suite of tests to track regressions against the update and ensure that the implementation does what is expected. I took some key tests from that suite and created the test file contained in this PR.

# Overview of Changes

I have staged the changes to hopefully make this easier to review. Here's an overview.

NOTE: The individual commits list or summarize the relevant lines of UnicodeData.txt where applicable.

First, I normalized the existing table to a reasonable format (same as the output of the tool) to make the later commits more clear. This involves fixing the casing and sorting deltas on each line in ascending order.

3d0f37f

Next, I fixed a few bugs with the current table that were preventing some cases from being matched correctly.

abb5d91
4894d24
25049de

Added new codepoints:

f197902 - GREEK LETTER YOT
af2d083 - Cyrillic
cba5439 - Cherokee
6c25a51 - Latin extensions

Other tests and fixes:

cb736ab - Add test cases from #517 to ensure those issues are fixed.
fbfb953 - 0x0345 and case-insensitive equivalent characters with or without /u flag.
dc3e750 - Case-insensitive matching for Cherokee only with /u. [1]
d96eed5 - All other Unicode 8.0 cases of case-insensitive matching only with /u. [1] Added generated tests.

[1] These were with a focus on compat with v8 as determined by running the full regression test suite I generated against node-6.9.4-LTS and node-7.4.0 (latest), and double-checking a handful of tests against the latest stable Chrome (v 55).

# Test Coverage

* **Regex test run successful!** `Summary: E:\d\RegexTestCollateral had 151147 tests; 0 failures`
* Internal and slow tests pass.
* Note that PRs are merged with the target branch before running Jenkins checks so attempting to run slow tests on this PR would result in failures as per #2316 -- but running them locally on this branch, the tests pass.

# Reviewers

@tcare @bterlson @boingoing @Cellule - Thank you for your assistance with this change and for your code reviews.
@dilijev dilijev deleted the unicase branch January 14, 2017 00:12
@mathiasbynens
Copy link
Contributor

Belated LGTM, although I skipped the generated files in favor of the code that generates them, i.e. dilijev#3. 👍 Glad to see this merged!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants