Introduction
The RegExp Library is designed for flexibility; it allows mixing and matching of different front-end syntax with back-end engines, as well as supporting arbitrary input sources. This flexibility, however, comes at the cost of making some of the simple applications a bit less obvious. This tutorial shows how the RegExp Library can be used in a variety of common applications.
Assembling an Regular Expression Matcher
Before we can do anything else, we must assemble a regular-expression
matcher. For the purposes of this tutorial, we use a combination
of the AwkSyntax front-end and the BackTrackEngine back-end.
structure RE = RegExpFn(
structure P = AwkSyntax
structure E = BackTrackEngine)
Match trees
Regular expressions may contain grouping
operators. When a pattern matches a string, these groups induce a nested
tree structure on the matched string.
The MatchTree structure defines
a polymorphic representation of this structure, along with a
number of utility functions for extracting information from
a match.
structure MT = MatchTree
Example: scanning tokens
The match function in the REGEXP signature allows one to distinguish
between a set of possible regular expression matches. One application of
this mechanism is a simple scanner. Let us define a datatype for tokens,
which can be white space, numbers, or identifiers.
datatype tok
= WS | NUM of IntInf.int | ID of string
We can then define the scanner as follows:
fun scanner getc gets = let
fun getMatch cons (MT.Match({pos, len}, _)) = cons (gets (pos, len))
in
RE.match [
("[ \t\n]+", getMatch (fn _ => WS)),
("[0-9]+", getMatch (fn s => NUM(valOf(IntInf.fromString s)))),
("[a-zA-Z][a-zA-Z0-9]*", getMatch ID)
] getc
end
Here the getc parameter is the standard character reader; we have also included
the gets parameter, which is a function of type
'strm * int -> string
for getting a string from a stream. For many input sources, the gets function
has an efficient and direct implementation, but it can also be implemented in
terms of the getc function as follows:
fun gets getc (strm, n) = let
fun getChrs (0, _, chrs) = String.implodeRev chrs
| getChrs (n, strm, chrs) = (case getc strm
of NONE => raise Fail "empty stream"
| SOME(c, strm) => getChrs (n-1, strm, c::chrs)
(* end case *))
in
getChrs (n, strm, [])
end;
Because this function is only called after the scanner function has matched
a sequence of n characters from strm, the "empty stream" case will not
occur for well behaving input streams.
Here is an example of using the scanner to tokenize strings, where we use the Basis Library substring type to implement the stream type:
fun tokens s = let
fun gets (ss, n) = Substring.string(Substring.slice (ss, 0, SOME n))
val scan = scanner Substring.getc gets
fun lp (ss, toks) = (case scan ss
of SOME(tok, ss') => lp (ss', tok::toks)
| NONE => List.rev toks
(* end case *))
in
lp (Substring.full s, [])
end;