This is a command line tool for interacting with the regex, regex-automata and regex-syntax crates. It enables one to print debug representations of various values, run searches, generate DFAs and deserialization code and perform various regex development tasks such as generating tests.
Simply use cargo
to install from crates.io.
$ cargo install regex-cli
The regex-cli
command provides a way to print the debug output for most
of the principle types in the regex-automata
crate. This can be useful for
debugging purposes when working on the regex
project, or even if you just
want a better look at a regex object's internal representation. For example,
the following two commands compare and contrast the differences in the NFA for
.
and (?-u:.)
:
$ regex-cli debug thompson '.' --no-table
thompson::NFA(
>000000: binary-union(2, 1)
000001: \x00-\xFF => 0
^000002: capture(pid=0, group=0, slot=0) => 10
000003: \x80-\xBF => 11
000004: \xA0-\xBF => 3
000005: \x80-\xBF => 3
000006: \x80-\x9F => 3
000007: \x90-\xBF => 5
000008: \x80-\xBF => 5
000009: \x80-\x8F => 5
000010: sparse(\x00-\t => 11, \x0B-\x7F => 11, \xC2-\xDF => 3, \xE0 => 4, \xE1-\xEC => 5, \xED => 6, \xEE-\xEF => 5, \xF0 => 7, \xF1-\xF3 => 8, \xF4 => 9)
000011: capture(pid=0, group=0, slot=1) => 12
000012: MATCH(0)
transition equivalence classes: ByteClasses(0 => [\x00-\t], 1 => [\n], 2 => [\x0B-\x7F], 3 => [\x80-\x8F], 4 => [\x90-\x9F], 5 => [\xA0-\xBF], 6 => [\xC0-\xC1], 7 => [\xC2-\xDF], 8 => [\xE0], 9 => [\xE1-\xEC], 10 => [\xED], 11 => [\xEE-\xEF], 12 => [\xF0], 13 => [\xF1-\xF3], 14 => [\xF4], 15 => [\xF5-\xFF], 16 => [EOI])
)
And now for (?-u:.)
:
$ regex-cli debug thompson -b '(?-u:.)' --no-table
thompson::NFA(
>000000: binary-union(2, 1)
000001: \x00-\xFF => 0
^000002: capture(pid=0, group=0, slot=0) => 3
000003: sparse(\x00-\t => 4, \x0B-\xFF => 4)
000004: capture(pid=0, group=0, slot=1) => 5
000005: MATCH(0)
transition equivalence classes: ByteClasses(0 => [\x00-\t], 1 => [\n], 2 => [\x0B-\xFF], 3 => [EOI])
)
To make things a bit more concise, we use --no-table
to omit some extra
metadata about the size of the NFA and the time required to build it.
In the second example, we also pass the -b/--no-utf8-syntax
flag. Without
it, the command returns an error because patterns are compiled with default
settings. The default setting is to forbid any pattern that can possibly match
invalid UTF-8. Since (?-u:.)
matches any byte except for \n
, it can match
invalid UTF-8. Thus, you have to say, "I am explicitly okay with matching
invalid UTF-8."
This command shows how to run a search with multiple patterns with each containing capture groups. The output shows the value of each matching group.
$ regex-cli find capture meta -p '(?m)^(?<key>[[:word:]]+)="(?<val>[^"]+)"$' -p $'(?m)^(?<key>[[:word:]]+)=\'(?<val>[^\']+)\'$' -y 'best_album="Blow Your Face Out"'
parse time: 81.541µs
translate time: 52.035µs
build meta time: 805.696µs
search time: 426.391µs
total matches: 1
0:{ 0: 0..31/best_album="Blow\x20Your\x20Face\x20Out", 1/key: 0..10/best_album, 2/val: 12..30/Blow\x20Your\x20Face\x20Out }
In this case, meta
refers to the regex engine. It can be a number of other
things, including lite
for testing the regex-lite
crate. Also, capture
refers to the kind of search. You can also just ask for the match
which will
print the overall match and not the capture groups:
$ regex-cli find match meta -p '(?m)^(?<key>[[:word:]]+)="(?<val>[^"]+)"$' -p $'(?m)^(?<key>[[:word:]]+)=\'(?<val>[^\']+)\'$' -y 'best_album="Blow Your Face Out"'
parse time: 67.067µs
translate time: 40.005µs
build meta time: 586.163µs
search time: 291.633µs
total matches: 1
0:0:31:best_album="Blow\x20Your\x20Face\x20Out"
Since not all regex engines support capture groups, using match
will open up
the ability to test other regex engines such as hybrid
.
Finally, the -p/--pattern
flag specifies a pattern and the -y/--haystack
flag provides a haystack to search as a command line argument. One can also omit
the -y/--haystack
flag and provide a file path to search instead:
$ echo 'best_album="Blow Your Face Out"' > haystack
$ regex-cli find match hybrid -p '(?m)^(?<key>[[:word:]]+)="(?<val>[^"]+)"$' -p $'(?m)^(?<key>[[:word:]]+)=\'(?<val>[^\']+)\'$' haystack
parse time: 60.278µs
translate time: 43.832µs
compile forward nfa time: 462.148µs
compile reverse nfa time: 56.275µs
build forward hybrid time: 6.532µs
build reverse hybrid time: 4.089µs
build regex time: 4.899µs
cache creation time: 18.59µs
search time: 54.653µs
total matches: 1
0:0:31:best_album="Blow\x20Your\x20Face\x20Out"
One particularly useful command in regex-cli
is regex-cli generate serialize
. It takes care of generating and writing a fully compiled DFA to
a file, and then producing Rust code that deserializes it. The command line
provides oodles of options, including all options found in the regex-automata
crate for building the DFA in code.
Let's walk through a complete end-to-end example. We assume regex-cli
is
already installed per instructions above. Let's start with an empty binary
Rust project:
$ mkdir regex-dfa-test
$ cd regex-dfa-test
$ cargo init --bin
Now add a dependency on regex-automata
. Technically, the only feature that
needs to be enabled for this example is dfa-search
, but we include std
as
well to get some conveniences like std::error::Error
implementations and also
optimizations. But you can drop std
and just use alloc
or even drop alloc
too altogether if necessary.
$ cargo add regex-automata --features std,dfa-search
Now we can generate a DFA with regex-cli
. This will create three files: the
little endian binary serialization of the DFA, the big endian version and a
simple Rust source file for lazily deserializing the DFA via a static into a
regex_automata::util::lazy::Lazy
:
regex-cli generate serialize sparse dfa \
--minimize \
--shrink \
--start-kind anchored \
--rustfmt \
--safe \
SIMPLE_WORD_FWD \
./src/ \
"\w"
We pass a number of flags here. There are even more available, and generally speaking, there is at least one flag for each configuration knob available in the library. This means that it should be possible to configure the DFA in any way you might expect to be able to in the code. We can briefly explain the flags we use here though:
--minimize
applies a DFA minimization algorithm to try and shrink the size of the DFA as much as possible. In some cases it can make a big difference, but not all. Minimization can also be extremely expensive, but given that this is an offline process and presumably done rarely, it's usually a good trade off to make.--shrink
uses heuristics to make the size of the NFA smaller in some cases. This doesn't impact the size of the DFA, but it can make determinization (the process of converting an NFA into a DFA) faster at the cost of making NFA construction slower. This can make overall DFA generation time faster.--start-kind anchored
says to build a DFA that only supports anchored searches. (That is, every match must have a start offset equivalent to the start of the search.) Without this, DFAs support both anchored and unanchored searches, and that in turn can make them much bigger than they need to be if you only need one or the other.--rustfmt
will runrustfmt
on the generated Rust code.--safe
will use only safe code for deserializing the DFA. This may be slower, but it is a one time cost. If you find that deserializing the DFA is too slow, then dropping this option will use alternative APIs that may result in undefined behavior if the given DFA is not valid. (Every DFA generated byregex-cli
is intended to be valid. So not using--safe
should always be correct, but it's up to you whether it's worth doing.)
The final three positional arguments are as follows:
SIMPLE_WORD_FWD
is the name of the variable in the Rust source code for the DFA, and it is also used in generating the names of the files produced by this command../src/
is the directory to write the files.\w
is the regex pattern to build the DFA for. More than one may be given!
Once the DFA is generated, you should see three new files in ./src/
:
$ ls -l src/
total 32
-rw-rw-r-- 1 andrew users 45 May 28 22:04 main.rs
-rw-rw-r-- 1 andrew users 11095 May 30 10:24 simple_word_fwd.bigendian.dfa
-rw-rw-r-- 1 andrew users 11095 May 30 10:24 simple_word_fwd.littleendian.dfa
-rw-rw-r-- 1 andrew users 711 May 30 10:24 simple_word_fwd.rs
At this point, you just need to add the appropriate mod
definition in
main.rs
and use the DFA:
use regex_automata::{dfa::Automaton, Anchored, Input};
use crate::simple_word_fwd::SIMPLE_WORD_FWD as DFA;
mod simple_word_fwd;
fn main() {
let input = Input::new("ω").anchored(Anchored::Yes);
println!("is a word: {:?}", DFA.try_search_fwd(&input));
let input = Input::new("☃").anchored(Anchored::Yes);
println!("not a word: {:?}", DFA.try_search_fwd(&input));
}
And now run the program:
$ cargo run
Compiling regex-dfa-test v0.1.0 (/home/andrew/tmp/regex-dfa-test)
Finished dev [unoptimized + debuginfo] target(s) in 0.17s
Running `target/debug/regex-dfa-test`
is a word: Ok(Some(HalfMatch { pattern: PatternID(0), offset: 2 }))
not a word: Ok(None)
There are a few other things worth mentioning:
- The above generates a "sparse" DFA. This sacrifices search performance in favor of (potentially much) smaller DFAs. One can also generate a "dense" DFA to get faster searches but larger DFAs.
- Above, we generated a "dfa," but one can also generate a "regex." The
difference is that a DFA can only find the end of a match (or start of a match
if the DFA is reversed), where as a regex will generate two DFAs: one for
finding the end of a match and then another for finding the start. One can
generate two DFAs manually and stitch them together in the code, but generating
a
regex
will take care of this for you.