-
Notifications
You must be signed in to change notification settings - Fork 449
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UnicodeSetsMode support (v
flag mode, \q
)
#1142
Comments
I confess that I am somewhat interested in supporting this. In particular, I've found that the existing support for class set operations (union, complement, intersection, difference) to be enormously useful when it comes to dealing with Unicode character classes. I'm really glad the I have two general threads of concern here. Firstly, based on the reading I've done (see the end of this comment for roughly what I read), it seems clear to me that the driving reason behind supporting "properties of strings" is Emoji. That is, I'm sure it's no coincidence that every single string property required by UTS#18 S2.7 is related to emoji. Therefore, I want to make sure that we can hit the Emoji use case. I haven't actually read UTS#51 in full detail yet, but there are aspects I don't yet understand. One is whether matching emoji correctly requires something other than a "simple" regex. For example, UTS#51 specifically provides a regex for matching possible emoji. That regex is already supported today by the
But...
So clearly the "possible" part here is rather relevant, since the regex matches ASCII digits. Because I haven't read UTS#51 in full yet, I don't know what extra validity checking steps are needed here. And in general, I would want to know this before we start on adding support for this because I want to make sure that matching emoji is actually feasible to do. That is, if we had properties of strings, how would I'd also like to point out how long it takes to build the regex above. It's already at 1.5ms. I'm sure aspects of that could be further optimized, but I don't see it getting much better. In particular, large Unicode classes add a lot of overhead to regex compilation. Will properties of strings be even worse? If so, how much worse? I feel like big Unicode classes are already pushing the limits of what is acceptable and I worry that big Unicode sets of strings might be even worse. The second general concern I have is implementation. While you astutely point out that the
There's a reason why this restriction exists, because mixing up what it means to be a "character" within a single character class is tricky and would require significant surgery to how character classes in the HIR are represented. Evolving character classes to arbitrary finite sets of strings will thus require significant surgery as well. The place where this really comes up is in the implementation of Unicode character class set operations. You don't need to read it in depth, but the key fact to understand is that it is tightly coupled to the concept of Unicode codepoint ranges. (That is, a My supsicion is that all of this would need to be completely gutted and rethought to account for arbitrary sets of finite strings. The pain doesn't stop there unfortunately. The lowering from an HIR to an NFA (where UTF-8 is made manifest) also exploits the fact that classes are just ranges of characters. This is also where a lot of the time spent compiling large Unicode classes is spent. I'm nearly certain that the existing technique used, Daciuk's algorithm, probably cannot be easily adapted to finite sets of strings? Daciuk's algorithm requires that the elements in the set are sorted lexicographically, but UTS#18 requires that
Logically, yes. But that won't work as an implementation because it would require explicitly representing every codepoint in a class as a distinct element in a Additional reading:
Note: I found it interesting that
|
I think this should be possible to adapt fairly straightforwardly (though I very easily could have overlooked something) by building from reverse lexicographic order instead, which will naturally process longer strings before shorter ones. The simple set approach should probably be sufficient for sets manually authored with |
v
flag mode, \q
)v
flag mode, \q
)
Yes, I wouldn't expect the browser implementations to be of much help. They can use different techniques due to the use of backtracking and interpreter structure. Here, we really have to pound everything down into an FSM. Still, it might be worth looking at them to get a sense of what they do. I don't know for sure that it won't be helpful. But it is usually the case that backtracking and FSM engines use very different approaches. I'm skeptical of Daciuk's algorithm being able to work in reverse, but I can't point out a flaw with it off the top of my head.
If it's just |
I looked into UTS#51 a bit out of curiosity. As far as I can tell, there's no way to exactly match the UTS#51§1.4.6 emoji sets other than direct validity tests against the set inclusion list, potentially accelerated by the possible match EBNF as a scanner. (I haven't looked at UTS#18, but I did scan ECMA262 specifically to find the list of supported binary properties of strings.) Even if we don't support string-sets, I do think recognizing Even in the absence of "string class" set operations, string properties would still be interesting (and less effort) to support — it'd be sufficient for implementing UAX#31's emoji profile, for example. I might throw together a script to build these as regexes to see how regex (and logos and tree-sitter) handle them. A little bit of cleverness can build a more compact match than pure alternation, e.g. it currently (these properties are not stable between Unicode versions) holds that
A final aside: the main problem with the standard possible emoji regex is matching strings that are definitely not emoji. It may be possible to construct a slightly stronger regex which doesn't fully constrain itself to RGI strings without greatly increasing match complexity. Disclaimer: I have not yet attempted to build and compare such a "likely emoji" test against the "possible emoji" test, and UTS#51 does acknowledge the concept as "many times" more complicated, while (by design) still requiring validity tests2. However, I posit that eliminating "obvious" false positives may be sufficiently useful for that cost even without further verification in sufficiently fuzzy applications. (Although it may enable wrongly bypassing said verification, since it works well enough to make the need nonobvious.) Footnotes
|
Nerd sniped myself. Regexes and builder script, getting semi-reasonable regex sizes (except RGI_Emoji_ZWJ_Sequence which I haven't attempted to golf yet) https://gist.github.com/CAD97/9f14e8f691c59cb773d9458793c3cbb2 # this is nushell 0.90.1, working with data from Unicode Emoji, Version 15.1
~> let rgi_emoji = (emoji regex) | values | reverse | each { $"\(?:($in))" } | str join '|'
~> $rgi_emoji | str length
66131
~> regex-cli find match meta -p $rgi_emoji -y ⚽
parse time: 2.497ms
translate time: 3.613ms
build meta time: 8.388ms
search time: 1.3928ms
total matches: 1
0:0:3:⚽
~> regex-cli find match meta -p $rgi_emoji -y ⚕️
parse time: 3.1441ms
translate time: 3.8318ms
build meta time: 10.1181ms
search time: 2.4802ms
total matches: 1
0:3:9:⚕️
~> let rgi_emoji_lite = (emoji regex) | reject RGI_Emoji_ZWJ_Sequence | values | reverse | each { $"\(?:($in))" } | str join '|'
~> $rgi_emoji_lite | str length
36950
~> regex-cli find match meta -p $rgi_emoji_lite -y ⚽
parse time: 1.4788ms
translate time: 2.8648ms
build meta time: 4.0901ms
search time: 782.4µs
total matches: 1
0:0:3:⚽
~> regex-cli find match meta -p $rgi_emoji_lite -y ⚕️
parse time: 1.4439ms
translate time: 2.9059ms
build meta time: 3.9654ms
search time: 970.4µs
total matches: 1
0:3:9:⚕️
~> (emoji regex).Basic_Emoji | str length
2529
~> regex-cli find match meta -p (emoji regex).Basic_Emoji -y ⚽
parse time: 126µs
translate time: 198.2µs
build meta time: 577.9µs
search time: 55.6µs
total matches: 1
0:0:3:⚽
# comparison
~> regex-cli find match meta -p '(?x)\p{RI}\p{RI}|\p{Emoji}(\p{EMod}|\x{FE0F}\x{20E3}?|[\x{E0020}-\x{E007E}]+\x{E007F})?(\x{200D}(\p{RI}\p{RI}|\p{Emoji}(\p{EMod}|\x{FE0F}\x{20E3}?|[\x{E0020}-\x{E007E}]+\x{E007F})?))*' -y ⚽
parse time: 81.6µs
translate time: 75.4µs
build meta time: 674.1µs
search time: 69.4µs
total matches: 1
0:0:3:⚽
# apparently my CPU enjoys this task more than yours does 😅 This should at least give an estimate as to processing time for the "handle in the automata" approach. I'm not super hopeful on it being possible to do order-of-magnitude better without handling these string sets at a different layer than this. |
Can you try with 10ms doesn't excite me, but it is perhaps on the boundary of reasonableness for stuff like this. |
~> regex-cli debug thompson -q $rgi_emoji
parse time: 2.7808ms
translate time: 3.811ms
compile nfa time: 3.9529ms
memory: 1367332
states: 56156
pattern len: 1
capture len: 1
has empty?: false
is utf8?: true
is reverse?: false
line terminator: "\n"
lookset any: ∅
lookset prefix any: ∅
~> regex-cli debug thompson -q $rgi_emoji_lite
parse time: 1.4414ms
translate time: 2.0979ms
compile nfa time: 2.1962ms
memory: 696444
states: 28447
pattern len: 1
capture len: 1
has empty?: false
is utf8?: true
is reverse?: false
line terminator: "\n"
lookset any: ∅
lookset prefix any: ∅
|
It's a bit of an implementation detail, but I hint at shrinking for NFAs being only applicable for reverse NFAs in the docs. (Pretty much all of the CLI options are designed to be in one-to-one correspondence with the config knobs in the library.) For the forward direction, Daciuk's algorithm is used to build nearly minimal DFAs. It's probably worth looking at the reverse NFAs too, since they will need to be built in most scenarios. They are typically bigger. Otherwise, the sizes for the forward NFAs look like they're about what I would expect. |
Reverse nfa stats: ~> regex-cli debug thompson --captures none -rq $rgi_emoji
parse time: 2.4019ms
translate time: 3.5685ms
compile nfa time: 3.2018ms
memory: 1381396
states: 56914
pattern len: 1
capture len: 0
has empty?: false
is utf8?: true
is reverse?: true
line terminator: "\n"
lookset any: ∅
lookset prefix any: ∅
~> regex-cli debug thompson --captures none -rq $rgi_emoji --shrink
parse time: 2.5419ms
translate time: 3.9212ms
compile nfa time: 4.3885ms
memory: 1386980
states: 56292
pattern len: 1
capture len: 0
has empty?: false
is utf8?: true
is reverse?: true
line terminator: "\n"
lookset any: ∅
lookset prefix any: ∅
~> regex-cli debug thompson --captures none -rq $rgi_emoji_lite
parse time: 1.4259ms
translate time: 2.0568ms
compile nfa time: 1.659ms
memory: 710580
states: 29208
pattern len: 1
capture len: 0
has empty?: false
is utf8?: true
is reverse?: true
line terminator: "\n"
lookset any: ∅
lookset prefix any: ∅
~> regex-cli debug thompson --captures none -rq $rgi_emoji_lite --shrink
parse time: 1.4384ms
translate time: 2.0845ms
compile nfa time: 2.98ms
memory: 716092
states: 28583
pattern len: 1
capture len: 0
has empty?: false
is utf8?: true
is reverse?: true
line terminator: "\n"
lookset any: ∅
lookset prefix any: ∅ so roughly the same size as forward, with a bit more overhead for the cases I've encoded like Perhaps worth reiterating, it should be possible to pre-build the Daciuk trie for these, assuming it can handle longest-first semantics. I don't know how difficult it'd be to actually remember the prebuilt trie, to splice a prebuilt NFA subgraph into the meta engine, nor to symbolically thread that set through, but that would cut down regex compilation time down for the "uses predefined sets" case. |
ECMAScript RegExp supports an interesting feature: v-mode enables
\p
to match string properties (e.g.\p{RGI_Emoji_Flag_Sequence}
, roughly\p{Regional_Indicator}{2}
), allows[]
character classes to match finite-length strings, and adds the\q{regexp}
escape for writing "string literals" in character classes (a nested regex expression where the only valid syntax is literals, escaped literals, and|
disjunction).This support is quite interesting because it allows doing set operations on (finite) sets of (finite) strings (e.g. grapheme clusters / emoji), which can enable matching patterns that might otherwise be achieved with lookahead, e.g.
^[\p{RGI_Emoji_Flag_Sequence}--\q{🇺🇸|🇨🇳|🇷🇺|🇬🇧|🇫🇷}]$
versus^(?!🇺🇸|🇨🇳|🇷🇺|🇬🇧|🇫🇷)\p{Regional_Indicator}{2}$
(example from MDN). This functionality is required for implementing things such as the UAX#31 Emoji Profile for identifiers, which "requires the use of character sequences, rather than individual code points, in the sets Start and Continue defined by UAX31-D1."I don't think the regex crate particularly needs to be carrying around additional Unicode tables1 for extended
\p
, but\q
seems somewhat reasonably straightforward to support without impacting users who don't use it (modulo some code size / compile time) and makes good chunks of the problem space at least possible to address, if still annoying2. Because\q
is so restricted, it would even be sufficient to not permit alternation within the\q
and require writing multiple\q
instead, e.g. making the prior regex^[\p{RGI_Emoji_Flag_Sequence}--[\q{🇺🇸}\q{🇨🇳}\q{🇷🇺}\q{🇬🇧}\q{🇫🇷}]]$
, at no loss of expressiveness, just convenience.Is it possible that the regex crate could someday support
\q
in character classes? I do understand it could be a significant chunk of extra work to allow classes to match longer strings. Although, on the other hand, since sets already need to handle UTF-8's variable 1..=4 byte length codepoint encoding, generalizing that to 1.. bytes might not be completely different. And while that kind of processing could be done, it doesn't need to be for what is, ultimately, a quite niche feature, and is ultimately (after resolving set operations and longest-first ordering) no different from regular top-level alternation.Concrete changes
ast::ClassSetItem
variant for\q
a. consisting of an
Alternation
ofConcat
ofLiteral
; orb. which is a
LiteralString
structured likeLiteral
, and\q{one|two}
is actually two of them, e.g. roughly[LiteralString { span: "\q{one", s: "one" }, LiteralString { span: "|two}", s: "two" }]
.\q{
introduces a string set withmacro_rules!
-ish syntax\q{ $( $($lit:Literal)* )|* }
. Encountering any meta character except a literal escape is an error.\q
inside a negated bracketed class is an error.hir::Class
variant for\q
, i.e. aClassStrings
. It can't be negated, but it can beunion
ed,intersect
ed,difference
d, andsymmetric_difference
d. It's essentially a fancy wrapper aroundHashSet<String>
.\q
ends up beingClassStrings
.If the general idea seems sound, I'm willing to try my hand at implementing
\q
(to be transparent to HIR, thus invisible to regex-automata).References
RegExp.prototype.unicodeSets
\q
documentation)Footnotes
At least not until such a day as icu4x allows it to be plausible for other crates to share the same underlying tables and/or for the binary package to have some control over what tables are built into the executable. Along that line,
\p
would be quite interesting — probably something like afn(&self, &str) -> Option<impl ExactSizeIterator<Item=&str>>
callback, to handle v-mode string sets, with a fast path for CodePointInversionLists — but that's drifting far off topic. ↩Listing out each string in the relevant sets in
|
-delimited lists, oh my. ↩The text was updated successfully, but these errors were encountered: