Version 1.01
TAN-regex is an XSLT library that extends regular expressions used by XPath functions matches(), replace(), tokenize(), and analyze-string(). The parallel TAN-regex functions, rgx:matches(), rgx:replace(), rgx:tokenize(), and rgx:analyze-string(), behave exactly like the standard XPath functions, but permit the escape character \u{}, which takes four types of constructions.
- hexadecimal codepoints, e.g.,
\u{3f-4a 1faa}. - Unicode name words, e.g.,
\u{.omega!greek}(any Unicode character whose name includes the word "OMEGA" but not the word "GREEK"). - Unicode composites, e.g.,
\u{+b}(any Unicode character that can be decomposed to a "b", e.g., bᵇḃḅḇ. - Unicode simple decompositions, e.g.,
\u{-ǡḃčď}(converts the character class to 'abcd').
If a particular version of Unicode is desired, use the $flags parameter, e.g., rgx:matches($input-text, '\u{.bottle}', '11.0').
A construction may take multiple items, space delimited, e.g., \u{+* .pizza} (any character with a plus as a component and any character with "pizza" in the name).
Functions are in the TAN namespace, tan:textalign.net,2015:ns. The prefix rgx is recommended.
Other useful functions:
rgx:string-to-components(). Takes each character in an input string and returns a concatenation of its decomposed components.rgx:string-to-composites(). Takes each character in an input string and returns a concatenation of characters that can decompose to that character.rgx:string-base(). Changes in an input string any characters that can decompose to a single base character.- Key
get-chars-by-name, e.g.,key('get-chars-by-name', ('parenthesis'), $default-ucd-names-db). Returns a tree fragment with Unicode characters with matching words in their names.
See the subdirectory tests for examples and
the XSLT function library for all the functions, with documentation.
TAN-regex has been developed in service to the Text Alignment Network (http://textalign.net), but can be used independent of TAN. The functions are encapsulated, so can be incorporated by any XSLT stylesheet via <include> or <import>.
For more on TAN-regex see Joel Kalvesmaki, “A New \u: Extending XPath Regular Expressions for Unicode.” Proceedings of Balisage: The Markup Conference 2020. Balisage Series on Markup Technologies, vol. 25 (2020). https://doi.org/10.4242/BalisageVol25.Kalvesmaki01.
I thank the following individuals for suggestions that significantly improved TAN-regex: C. M. Sperberg-McQueen, David Birnbaum.
1.01: 2021-07-06: set visibility on all functions; demoted numeric functions (better supported in the TAN function library).