The RegExpSyntax structure provides an abstract-syntax-tree
representation of regular expressions. Its main purpose is to
provide communication between different front-ends (implementing
different RE specification languages), and different back-ends
(implementing different compilation/searching algorithms).
It is also possible, however, to use it as a way to directly
specify a regular expression for a back-end engine.
Synopsis
signature REGEXP_SYNTAX
structure RegExpSyntax : REGEXP_SYNTAX
Interface
exception CannotCompile
structure CharSet : ORD_SET where type Key.ord_key = char
datatype syntax
= Group of syntax
| Alt of syntax list
| Concat of syntax list
| Interval of (syntax * int * int option)
| MatchSet of CharSet.set
| NonmatchSet of CharSet.set
| Char of char
| Begin
| End
val optional : syntax -> syntax
val closure : syntax -> syntax
val posClosure : syntax -> syntax
val fromRange : char * char -> CharSet.set
val addRange : CharSet.set * char * char -> CharSet.set
val allChars : CharSet.set
val alnum : CharSet.set
val alpha : CharSet.set
val ascii : CharSet.set
val blank : CharSet.set
val cntl : CharSet.set
val digit : CharSet.set
val graph : CharSet.set
val lower : CharSet.set
val print : CharSet.set
val punct : CharSet.set
val space : CharSet.set
val upper : CharSet.set
val word : CharSet.set
val xdigit : CharSet.se
Description
exception CannotCompile-
This exception is meant to be raised by back-ends when they encounter a feature that they cannot handle.
structure CharSet : ORD_SET where type Key.ord_key = char-
This substructure implements sets of 8-bit characters. Currently it is implemented using sorted lists (i.e., using the
ListSetFnfunctor), but that may be changed in the future. datatype syntax-
This datatype defines the abstract syntax of regular expressions that is supported by the library. The constructors are defined as follows:
-
Group re:: defines a match group (i.e., that produce a corresponding match-tree node for the input matched byre. -
Alt[re1, re2, …, ren]:: matches any ofre1,re2, …,ren. If the list is empty, then it matches nothing. -
Concat[re1, re2, …, ren]:: matches the concatenation ofre1,re2, …,ren. If the list is empty, then it matches the empty string. -
Interval(re, n, NONE):: matchesrerepeated at leastntimes. -
Interval(re, n, SOME m):: matchesrerepeated fromntomtimes. -
MatchSet cs:: matches a single character that is in the setcs. -
NonmatchSet cs:: matches a single character that is not in the setcs. -
Char c:: matches the single characterc. -
Begin:: matches beginning of the input stream. -
End:: matches end of the input stream.
-
val optional : syntax → syntax-
optional reis equivalent toInterval(re, 0, SOME 1). val closure : syntax → syntax-
closure reis equivalent toInterval(re, 0, NONE). val posClosure : syntax → syntax-
posClosure reis equivalent toInterval(re, 1, NONE). val fromRange : char * char -> CharSet.set-
fromRange (c1, c2)returns the set containing the characters in the range fromc1toc2(inclusive). This expression raises theSizeexception ifc2 < c1. val addRange : CharSet.set * char * char -> CharSet.set-
addRange (cs, c1, c2)adds the set of characters in the range fromc1toc2(inclusive) tocs. This expression raises theSizeexception ifc2 < c1. val allChars : CharSet.set-
is the set of all 8-bit characters.
POSIX Character Classes
The RegExpSyntax structure pre-defines the following character sets,
which are part of the POSIX regular-expression standard (plus a couple
of extras):
val alnum : CharSet.set-
is the set of letters and digits.
val alpha : CharSet.set-
is the set of letters.
val ascii : CharSet.set-
is the set of characters
csuch that0 <= ord c <= 127. val blank : CharSet.set-
is the set of
#"\t"and space. val cntl : CharSet.set-
is the set of non-printable characters.
val digit : CharSet.set-
is the set of decimal digits.
val graph : CharSet.set-
is the set of visible characters (does not include space).
val lower : CharSet.set-
is the set of lower-case letters.
val print : CharSet.set-
is the set of printable characters (includes space).
val punct : CharSet.set-
is the set of visible characters other than letters and digits.
val space : CharSet.set-
is the set of
#"\t",#"\r",#"\n",#"\v",#"\f", and space. val upper : CharSet.set-
is the set of upper-case letters.
val word : CharSet.set-
is the set of letters, digit, and
#"_". val xdigit : CharSet.set-
is the set of hexadecimal digits.