forked from chakra-core/ChakraCore
-
Notifications
You must be signed in to change notification settings - Fork 1
CaseInsensitive mapping generator tool #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
dilijev
wants to merge
40
commits into
master
Choose a base branch
from
CaseInsensitive
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…put is binary equivalent to before the change.
chakrabot
pushed a commit
to chakra-core/ChakraCore
that referenced
this pull request
Jan 13, 2017
…nicode 6.3 and later, up to 8.0) to full Unicode 8.0 support. Fixes #517 Merge pull request #2356 from dilijev:unicase Update CaseInsensitive table from hybrid of (Unicode 6.3 and later, up to 8.0) to full Unicode 8.0 support. Fixes #517 Note: The current standard wants Unicode 9.0 but it might be too risky to update that far in a stabilization branch. Opened #2367 to track this work item. The table was generated in the past but then was (mostly) manually edited to include various optimizations and to fix bugs over the years. To make sure we got a complete update, I wrote a tool to generate the table. ## CaseInsensitive mapping generator tool PR: dilijev#3 Source: https://github.com/dilijev/ChakraCore/tree/CaseInsensitive/tools/Unicode/CaseInsensitive From this tool I was able to see and apply the differences from the current implementation to the correct implementation. In order to keep the change as small as possible, I used the diff as a reference for what needed changing and left out non-essential diffs. Additionally, the tool generates a suite of tests to track regressions against the update and ensure that the implementation does what is expected. I took some key tests from that suite and created the test file contained in this PR. # Overview of Changes I have staged the changes to hopefully make this easier to review. Here's an overview. NOTE: The individual commits list or summarize the relevant lines of UnicodeData.txt where applicable. First, I normalized the existing table to a reasonable format (same as the output of the tool) to make the later commits more clear. This involves fixing the casing and sorting deltas on each line in ascending order. 3d0f37f Next, I fixed a few bugs with the current table that were preventing some cases from being matched correctly. abb5d91 4894d24 25049de Added new codepoints: f197902 - GREEK LETTER YOT af2d083 - Cyrillic cba5439 - Cherokee 6c25a51 - Latin extensions Other tests and fixes: cb736ab - Add test cases from #517 to ensure those issues are fixed. fbfb953 - 0x0345 and case-insensitive equivalent characters with or without /u flag. dc3e750 - Case-insensitive matching for Cherokee only with /u. [1] d96eed5 - All other Unicode 8.0 cases of case-insensitive matching only with /u. [1] Added generated tests. [1] These were with a focus on compat with v8 as determined by running the full regression test suite I generated against node-6.9.4-LTS and node-7.4.0 (latest), and double-checking a handful of tests against the latest stable Chrome (v 55). # Test Coverage * **Regex test run successful!** `Summary: E:\d\RegexTestCollateral had 151147 tests; 0 failures` * Internal and slow tests pass. * Note that PRs are merged with the target branch before running Jenkins checks so attempting to run slow tests on this PR would result in failures as per #2316 -- but running them locally on this branch, the tests pass. # Reviewers @tcare @bterlson @boingoing @Cellule - Thank you for your assistance with this change and for your code reviews.
chakrabot
pushed a commit
to chakra-core/ChakraCore
that referenced
this pull request
Jan 14, 2017
… hybrid of (Unicode 6.3 and later, up to 8.0) to full Unicode 8.0 support. Fixes #517 Merge pull request #2356 from dilijev:unicase Update CaseInsensitive table from hybrid of (Unicode 6.3 and later, up to 8.0) to full Unicode 8.0 support. Fixes #517 Note: The current standard wants Unicode 9.0 but it might be too risky to update that far in a stabilization branch. Opened #2367 to track this work item. The table was generated in the past but then was (mostly) manually edited to include various optimizations and to fix bugs over the years. To make sure we got a complete update, I wrote a tool to generate the table. ## CaseInsensitive mapping generator tool PR: dilijev#3 Source: https://github.com/dilijev/ChakraCore/tree/CaseInsensitive/tools/Unicode/CaseInsensitive From this tool I was able to see and apply the differences from the current implementation to the correct implementation. In order to keep the change as small as possible, I used the diff as a reference for what needed changing and left out non-essential diffs. Additionally, the tool generates a suite of tests to track regressions against the update and ensure that the implementation does what is expected. I took some key tests from that suite and created the test file contained in this PR. # Overview of Changes I have staged the changes to hopefully make this easier to review. Here's an overview. NOTE: The individual commits list or summarize the relevant lines of UnicodeData.txt where applicable. First, I normalized the existing table to a reasonable format (same as the output of the tool) to make the later commits more clear. This involves fixing the casing and sorting deltas on each line in ascending order. 3d0f37f Next, I fixed a few bugs with the current table that were preventing some cases from being matched correctly. abb5d91 4894d24 25049de Added new codepoints: f197902 - GREEK LETTER YOT af2d083 - Cyrillic cba5439 - Cherokee 6c25a51 - Latin extensions Other tests and fixes: cb736ab - Add test cases from #517 to ensure those issues are fixed. fbfb953 - 0x0345 and case-insensitive equivalent characters with or without /u flag. dc3e750 - Case-insensitive matching for Cherokee only with /u. [1] d96eed5 - All other Unicode 8.0 cases of case-insensitive matching only with /u. [1] Added generated tests. [1] These were with a focus on compat with v8 as determined by running the full regression test suite I generated against node-6.9.4-LTS and node-7.4.0 (latest), and double-checking a handful of tests against the latest stable Chrome (v 55). # Test Coverage * **Regex test run successful!** `Summary: E:\d\RegexTestCollateral had 151147 tests; 0 failures` * Internal and slow tests pass. * Note that PRs are merged with the target branch before running Jenkins checks so attempting to run slow tests on this PR would result in failures as per #2316 -- but running them locally on this branch, the tests pass. # Reviewers @tcare @bterlson @boingoing @Cellule - Thank you for your assistance with this change and for your code reviews.
| return ("0000" + this).slice(-4); // take the last four characters after left-padding | ||
| } | ||
|
|
||
| proto.toCodepoint = function (): number { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: should be toCodePoint for consistency with codePoint elsewhere
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Tool for taking UnicodeData.txt and CaseFolding.txt and generating a case-insensitive mapping table.
I've learned that designation of MappingSource is actually not useful in the expected semantic sense. In the table in CaseInsensitive.cpp, the difference between
MappingSource::UnicodeDataandMappingSource::CaseFoldingis actually that the former will do mappings with /i and not /u, and the latter will only map if the /u is supplied. Therefore, I made the classification of rows in the table manually.It appears there is still a bug with the transitive closure over equivalence classes with misses a couple of cases. I was able to use the existing table and manually correct these issues when relevant.
See also the source code at the HEAD of this branch: https://github.com/dilijev/ChakraCore/tree/CaseInsensitive/tools/Unicode/CaseInsensitive
Associated with: chakra-core#2356
/cc @tcare