Skip to content

Commit 80174a0

Browse files
authored
Merge pull request #45 from namehash/Carbon225/fix-emoji-matching
update spec to 1.9.4, fix emoji matching, empty name
2 parents 2c33b2d + a41e7db commit 80174a0

File tree

9 files changed

+74
-77
lines changed

9 files changed

+74
-77
lines changed

README.md

Lines changed: 16 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -21,13 +21,13 @@
2121
* **name** - a series of any number of labels (including 0) separated by label separators, e.g. `abc.eth`.
2222

2323
**Names**
24-
* **normalized name** - a name that is in normalized form according to the ENS Normalization Standard. This means `name == ens_normalize(name)`. A normalized name always contains at least 1 label. All labels in a normalized name always contain a sequence of at least 1 character.
24+
* **normalized name** - a name that is in normalized form according to the ENS Normalization Standard. This means `name == ens_normalize(name)`. A normalized name contains 0 or more labels. All labels in a normalized name always contain a sequence of at least 1 valid character. An empty string contains 0 labels and is a normalized name.
2525
* **normalizable name** - a name that is normalized or that can be converted into a normalized name using `ens_normalize`.
2626
* **beautiful name** - a name that is normalizable and is equal to itself when using `ens_beautify`. This means `name == ens_beautify(name)`. For all normalizable names `ens_normalize(ens_beautify(name)) == ens_normalize(name)`.
2727
* **disallowed name** - a name that is not normalizable. This means `ens_normalize(name)` raises a `DisallowedSequence`.
2828
* **curable name** - a name that is normalizable, or a name in the subset of disallowed names that can still be converted into a normalized name using `ens_cure`.
29-
* **empty name** - a name that is the empty string. An empty name is disallowed and not curable.
30-
* **namehash ready name** - a name that is ready for for use with the ENS `namehash` function. Only normalized and empty names are namehash ready. Empty names represent the ENS namespace root for use with the ENS `namehash` function. Using the ENS `namehash` function on any name that is not namehash ready will return a node that is unreachable by ENS client applications that use a proper implementation of `ens_normalize`.
29+
* **empty name** - a name that is the empty string. An empty string is a name with 0 labels. It is a *normalized name*.
30+
* **namehash ready name** - a name that is ready for for use with the ENS `namehash` function. Only normalized names are namehash ready. Empty names represent the ENS namespace root for use with the ENS `namehash` function. Using the ENS `namehash` function on any name that is not namehash ready will return a node that is unreachable by ENS client applications that use a proper implementation of `ens_normalize`.
3131

3232
**Sequences**
3333
* **unnormalized sequence** - a sequence from a name that is not in normalized form according to the ENS Normalization Standard.
@@ -119,14 +119,17 @@ ens_cure('Ni‍ck?.ETH')
119119
# 'nick.eth'
120120
# ZWJ and '?' are removed, no error is raised
121121

122-
# note: might still raise DisallowedSequence for certain names, which cannot be cured, e.g.
122+
# note: might remove all characters from the input, which would result in an empty name
123123
ens_cure('?')
124-
# DisallowedSequence: No valid characters in name
125-
# reason: '?' would have to be removed which would result in an empty name
124+
# '' (empty string)
125+
# reason: '?' is disallowed and no replacement can be suggested
126126

127-
ens_cure('0χх0.eth')
127+
# note: might still raise DisallowedSequence for certain names, which cannot be cured, e.g.
128+
ens_cure('0х0.eth')
128129
# DisallowedSequence: Contains visually confusing characters from Cyrillic and Latin scripts
129-
# reason: it is not clear which character should be removed ('χ' or 'х')
130+
# reason: The "х" is actually a Cyrillic character that is visually confusing with the Latin "x".
131+
# However, the "0"s are standard Latin digits and it is not clear which characters should be removed.
132+
# They conflict with each other because it is not known if the user intended to use Cyrillic or Latin.
130133
```
131134

132135
Get a beautiful name that is optimized for display:
@@ -275,12 +278,11 @@ Curable errors contain additional information about the disallowed sequence and
275278

276279
Disallowed name errors are not considered curable because it may be challenging to suggest a specific normalization suggestion that might resolve the problem.
277280

278-
| `DisallowedSequenceType` | General info |
279-
| ------------------------- | ------------ |
280-
| `EMPTY_NAME` | No valid characters in name |
281-
| `NSM_REPEATED` | Contains a repeated non-spacing mark |
282-
| `NSM_TOO_MANY` | Contains too many consecutive non-spacing marks |
283-
| `CONF_WHOLE` | Contains visually confusing characters from {script1} and {script2} scripts |
281+
| `DisallowedSequenceType` | General info | Explanation |
282+
| ------------------------- | ------------ | ------------------------ |
283+
| `NSM_REPEATED` | Contains a repeated non-spacing mark | Non-spacing marks can be encoded as one codepoint with the preceding character, which makes it difficult to suggest a normalization suggestion |
284+
| `NSM_TOO_MANY` | Contains too many consecutive non-spacing marks | Non-spacing marks can be encoded as one codepoint with the preceding character, which makes it difficult to suggest a normalization suggestion |
285+
| `CONF_WHOLE` | Contains visually confusing characters from {script1} and {script2} scripts | Both characters are equally likely to be the correct character to use and a normalization suggestion cannot be provided |
284286

285287
## Development
286288

ens_normalize/normalization.py

Lines changed: 19 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -58,10 +58,6 @@ class DisallowedSequenceType(DisallowedSequenceTypeBase):
5858
See README: Glossary -> Sequences.
5959
"""
6060

61-
# GENERIC ----------------
62-
63-
EMPTY_NAME = "No valid characters in name"
64-
6561
# NSM --------------------
6662

6763
NSM_REPEATED = "Contains a repeated non-spacing mark"
@@ -322,44 +318,18 @@ def filter_fe0f(text: str) -> str:
322318
return text.replace('\uFE0F', '')
323319

324320

325-
def add_all_fe0f(emojis: List[str]):
326-
"""
327-
Find all emoji sequence prefixes that can be followed by FE0F.
328-
Then, append FE0F to all prefixes that can but do not have it already.
329-
This emulates adraffy's trie building algorithm, which does not add FE0F nodes,
330-
but sets a "can be followed by FE0F" flag on the previous node.
331-
"""
332-
cps_with_fe0f = set()
333-
for cps in emojis:
334-
for i in range(1, len(cps)):
335-
if cps[i] == '\uFE0F':
336-
# remember the entire prefix to simulate trie behavior
337-
cps_with_fe0f.add(cps[:i])
338-
339-
emojis_out = []
340-
341-
for cps_in in emojis:
342-
cps_out = ''
343-
# for all prefixes
344-
for i in range(len(cps_in)):
345-
cps_out += cps_in[i]
346-
# check if the prefix can be followed by FE0F
347-
if cps_in[:i+1] in cps_with_fe0f and (i == len(cps_in) - 1 or cps_in[i + 1] != '\uFE0F'):
348-
cps_out += '\uFE0F'
349-
emojis_out.append(cps_out)
350-
351-
return emojis_out
352-
353-
354321
def create_emoji_regex_pattern(emojis: List[str]) -> str:
355-
# add all optional fe0f so that we can match emojis with or without it
356-
emojis = add_all_fe0f(emojis)
357322
fe0f = re.escape('\uFE0F')
358323
def make_emoji(emoji: str) -> str:
359324
# make FE0F optional
360325
return re.escape(emoji).replace(fe0f, f'{fe0f}?')
361326
# sort to match the longest first
362-
return '|'.join(make_emoji(emoji) for emoji in sorted(emojis, key=len, reverse=True))
327+
def order(emoji: str) -> int:
328+
# emojis with FE0F need to be pushed back because the FE0F would trap the regex matching
329+
# re.search(r'AF?|AB', '_AB_')
330+
# >>> <re.Match object; span=(1, 2), match='A'>
331+
return len(filter_fe0f(emoji))
332+
return '|'.join(make_emoji(emoji) for emoji in sorted(emojis, key=order, reverse=True))
363333

364334

365335
def create_emoji_fe0f_lookup(emojis: List[str]) -> Dict[str, str]:
@@ -549,9 +519,15 @@ def normalize_tokens(tokens: List[Token]) -> List[Token]:
549519
return collapse_valid_tokens(tokens)
550520

551521

552-
def post_check_empty(name: str) -> Optional[Union[DisallowedSequence, CurableSequence]]:
522+
def post_check_empty(name: str, input: str) -> Optional[CurableSequence]:
553523
if len(name) == 0:
554-
return DisallowedSequence(DisallowedSequenceType.EMPTY_NAME)
524+
# fully ignorable name
525+
return CurableSequence(
526+
CurableSequenceType.EMPTY_LABEL,
527+
index=0,
528+
sequence=input,
529+
suggested='',
530+
)
555531
if name[0] == '.':
556532
return CurableSequence(
557533
CurableSequenceType.EMPTY_LABEL,
@@ -786,9 +762,11 @@ def post_check_whole(group, cps: Iterable[int]) -> Optional[DisallowedSequence]:
786762
)
787763

788764

789-
def post_check(name: str, label_is_greek: List[bool]) -> Optional[Union[DisallowedSequence, CurableSequence]]:
765+
def post_check(name: str, label_is_greek: List[bool], input: str) -> Optional[Union[DisallowedSequence, CurableSequence]]:
790766
# name has emojis replaced with a single FE0F
791-
e = post_check_empty(name)
767+
if len(input) == 0:
768+
return None
769+
e = post_check_empty(name, input)
792770
if e is not None:
793771
return e
794772
label_offset = 0
@@ -1005,7 +983,7 @@ def ens_process(input: str,
1005983
# true for each label that is greek
1006984
# will be set by post_check()
1007985
label_is_greek = []
1008-
error = post_check(emojis_as_fe0f, label_is_greek)
986+
error = post_check(emojis_as_fe0f, label_is_greek, input)
1009987
if isinstance(error, CurableSequence): # or NormalizableSequence because of inheritance
1010988
offset_err_start(error, tokens)
1011989

ens_normalize/spec.pickle

-7.79 KB
Binary file not shown.

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[tool.poetry]
22
name = "ens-normalize"
3-
version = "3.0.3"
3+
version = "3.0.4"
44
description = "Ethereum Name Service (ENS) Name Normalizer"
55
license = "MIT"
66
authors = ["Jakub Karbowski <jakub@namehash.io>"]

tests/ens-normalize-tests.json

Lines changed: 1 addition & 1 deletion
Large diffs are not rendered by default.

tests/test_normalization.py

Lines changed: 28 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -111,8 +111,9 @@ def test_ens_tokenize_full():
111111
# --
112112
('aa--a', CurableSequenceType.HYPHEN, 2, '--', ''),
113113
# empty
114-
("", DisallowedSequenceType.EMPTY_NAME, None, None, None),
115114
("a..b", CurableSequenceType.EMPTY_LABEL, 1, '..', '.'),
115+
(".ab", CurableSequenceType.EMPTY_LABEL, 0, '.', ''),
116+
("ab.", CurableSequenceType.EMPTY_LABEL, 2, '.', ''),
116117
117118
# combining mark at the beginning
118119
('\u0327a', CurableSequenceType.CM_START, 0, '\u0327', ''),
@@ -297,7 +298,7 @@ def test_ens_is_normalized():
297298
assert is_ens_normalized('a')
298299
assert not is_ens_normalized('a_b')
299300
assert not is_ens_normalized('Abc')
300-
assert not is_ens_normalized('')
301+
assert is_ens_normalized('')
301302

302303

303304
def test_normalization_error_object():
@@ -315,18 +316,18 @@ def test_normalization_error_object():
315316
assert str(e) == e.general_info
316317
assert repr(e) == 'CurableSequence(code="UNDERSCORE", index=1, sequence="_", suggested="")'
317318
try:
318-
ens_normalize('')
319+
ens_normalize('0х0')
319320
except DisallowedSequence as e:
320-
assert e.type == DisallowedSequenceType.EMPTY_NAME
321-
assert e.code == DisallowedSequenceType.EMPTY_NAME.code
322-
assert e.general_info == DisallowedSequenceType.EMPTY_NAME.general_info
321+
assert e.type == DisallowedSequenceType.CONF_WHOLE
322+
assert e.code == DisallowedSequenceType.CONF_WHOLE.code
323+
assert e.general_info == DisallowedSequenceType.CONF_WHOLE.general_info.format(script1='Cyrillic', script2='Latin')
323324
assert str(e) == e.general_info
324-
assert repr(e) == 'DisallowedSequence(code="EMPTY_NAME")'
325+
assert repr(e) == 'DisallowedSequence(code="CONF_WHOLE")'
325326

326327

327328
def test_error_is_exception():
328329
with pytest.raises(Exception):
329-
ens_normalize('')
330+
ens_normalize('0х0')
330331

331332

332333
def test_str_repr():
@@ -344,9 +345,7 @@ def test_ens_cure():
344345
with pytest.raises(DisallowedSequence) as e:
345346
ens_cure('0x.0χ.0х')
346347
assert e.value.type == DisallowedSequenceType.CONF_WHOLE
347-
with pytest.raises(DisallowedSequence) as e:
348-
ens_cure('?')
349-
assert e.value.type == DisallowedSequenceType.EMPTY_NAME
348+
assert ens_cure('?') == ''
350349
assert ens_cure('abc.?') == 'abc'
351350
assert ens_cure('abc.?.xyz') == 'abc.xyz'
352351
assert ens_cure('?.xyz') == 'xyz'
@@ -358,6 +357,9 @@ def test_ens_process_cure():
358357
assert ret.cured == 'a.b'
359358
assert [e.code for e in ret.cures] == ['EMPTY_LABEL', 'UNDERSCORE']
360359
ret = ens_process('', do_cure=True)
360+
assert ret.cured == ''
361+
assert ret.cures == []
362+
ret = ens_process('0х0', do_cure=True)
361363
assert ret.cured is None
362364
assert ret.cures is None
363365

@@ -399,3 +401,18 @@ def test_data_creation():
399401
with open(ens_normalize_module.normalization.SPEC_PICKLE_PATH, 'rb') as f:
400402
buf2 = f.read()
401403
assert buf1 == buf2
404+
405+
406+
def test_empty_name():
407+
assert ens_normalize('') == ''
408+
assert ens_beautify('') == ''
409+
assert ens_tokenize('') == []
410+
assert ens_cure('') == ''
411+
412+
413+
def test_ignorable_name():
414+
assert ens_process('').error is None
415+
e = ens_process('\ufe0f\ufe0f').error
416+
assert e.type == CurableSequenceType.EMPTY_LABEL
417+
assert e.index == 0
418+
assert e.sequence == '\ufe0f\ufe0f'

tools/updater/package-lock.json

Lines changed: 7 additions & 7 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

tools/updater/package.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,6 @@
44
"start": "python update_ens.py"
55
},
66
"dependencies": {
7-
"@adraffy/ens-normalize": "github:adraffy/ens-normalize.js#4873fbe6393e970e186ab57860cc59cbbb1fa162"
7+
"@adraffy/ens-normalize": "github:adraffy/ens-normalize.js#0383b198462f594ae639ad7d46dcfbaff9b276fe"
88
}
99
}

tools/updater/update_ens.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@
1313
'@adraffy',
1414
'ens-normalize',
1515
'dist',
16-
'index.js',
16+
'index.mjs',
1717
)
1818

1919

0 commit comments

Comments
 (0)