The UTF8 structure provides support for working
with UTF-8
encoded strings. UTF-8 is a way to represent Unicode
code points in an 8-bit character type while being backward
compatible with the ASCII encoding for 7-bit characters.
The encoding scheme uses one to four bytes as follows:
| Wide Character Bits | Byte 0 | Byte 1 | Byte 2 | Byte 3 |
|---|---|---|---|---|
|
|
|||
|
|
|
||
|
|
|
|
|
|
|
|
|
|
There are three additional well-formedness restrictions on UTF-8 encodings that were introduced in the Unicode 3.1 and 3.2 standards.
-
Characters cannot be larger than
0x10FFFF(the maximum code point). -
Characters must be in the shortest encoding for the codepoint (e.g., using two bytes to encode an ASCII character is invalid).
-
Surogate pairs should be encoded as a single three-byte character instead of as two three-byte sequences.
Synopsis
signature UTF8
structure UTF8 :> UTF8
Interface
type wchar = word
val maxCodePoint : wchar
val replacementCodePoint : wchar
exception Incomplete
exception Invalid
datatype decode_strategy = STRICT | REPLACE
val getWC : decode_strategy -> (char, 'strm) StringCvt.reader -> (wchar, 'strm) StringCvt.reader
val getWCStrict : (char, 'strm) StringCvt.reader -> (wchar, 'strm) StringCvt.reader
val getu : (char, 'strm) StringCvt.reader -> (wchar, 'strm) StringCvt.reader
val encode : wchar -> string
val isAscii : wchar -> bool
val toAscii : wchar -> char
val fromAscii : char -> wchar
val toString : wchar -> string
val size : string -> int
val size' : substring -> int
val explode : string -> wchar list
val implode : wchar list -> string
val map : (wchar -> wchar) -> string -> string
val app : (wchar -> unit) -> string -> unit
val fold : ((wchar * 'a) -> 'a) -> 'a -> string -> 'a
val all : (wchar -> bool) -> string -> bool
val exists : (wchar -> bool) -> string -> bool
Description
type wchar = word-
The type of a Unicode code point.
Note that we use thewordtype for this because SML/NJ does not currently have a wide-character type. If such a type is introduced, then this type definition will likely change. val maxCodePoint : wchar-
The maximum code point in the Unicode character set (
0wx10FFFF).
val replacementCodePoint : wchar-
The replacement code point (
0wx00FFFD) for invalid or incomplete sequences when using theREPLACEdecode strategy.
exception Incomplete-
This exception is raised when certain operations are applied to incomplete strings (i.e., strings that end in the middle of multi-byte UTF-8 character encoding).
exception Invalid-
This exception is raised when invalid UTF-8 encodings, such as non-shortest-length encodings, are encountered.
datatype decode_strategy =…-
This datatype is used to specify the strategy for handling incomplete or incorrect UTF-8 sequences. The constructors are interpreted as follows:
STRICT-
Raise the
Incompleteexception when the end of input is encountered in the middle of an otherwise acceptable multi-byte encoding and raise theInvalidexception when an invalid byte is encountered.STRICTis the default decoding option.
REPLACE-
Replace invalid input with the
replacementCodePointusing the W3C standard substitution of maximal subparts algorithm.
val getWC : decode_strategy → (char, 'strm) StringCvt.reader -> (wchar, 'strm) StringCvt.reader-
getWC strategy getcreturns a wide-character reader for the character readergetc. The resulting reader handles incomplete and invalid encodings according the thestrategyargument. val getWCStrict : (char, 'strm) StringCvt.reader -> (wchar, 'strm) StringCvt.reader-
getu getcreturns a wide-character reader for the character readergetc. The resulting reader raises theIncompleteexception if it encounters an incomplete UTF-8 character and it raises theInvalidexception if it encounters an invalid encoding. This function is equivalent to the expressiongetWC STRICT. val getu : (char, 'strm) StringCvt.reader -> (wchar, 'strm) StringCvt.reader-
This function is equivalent to
getWCStrict. It is included for backwards compatibility. val encode : wchar -> string-
encode wcreturns the UTF-8 encoding of the wide characterwc. This expression raises theInvalidexception ifwcis greater than the maximum Unicode code point. val isAscii : wchar -> bool-
isAscii wcreturnstrueif, and only if,wcis an ASCII character. val toAscii : wchar -> char (* truncates to 7-bits *)-
toAscii wcconvertswcto an 8-bit character by truncatingwcto its low seven bits. val fromAscii : char -> wchar (* truncates to 7-bits *)-
toAscii cconverts the 8-bit charactercto a wide character in the ASCII range (the high bit ofcis ignored). val toString : wchar -> string-
toString wcreturns a printable string representation of a wide character as a Unicode escape sequence. val size : string -> int-
size sreturns the number of UTF-8 encoded Unicode characters in the strings. This expression raises theIncompleteexception if an incomplete character is encountered. val size : string -> int-
size sreturns the number of UTF-8 encoded Unicode characters in the strings. This expression raises theIncompleteexception if it encounters an incomplete UTF-8 character and it raises theInvalidexception if it encounters an invalid encoding. val size' : substring -> int-
size' ssreturns the number of UTF-8 encoded Unicode characters in the substringss. This expression raises theIncompleteexception if it encounters an incomplete UTF-8 character and it raises theInvalidexception if it encounters an invalid encoding. val explode : string -> wchar list-
explode sreturns the list of UTF-8 encoded Unicode characters that comprise the strings. val implode : wchar list -> string-
implode wcsreturns the UTF-8 encoded string that represents the listwcsof Unicode code points. This expression raises theInvalidexception if it encounters an invalid encoding. val map : (wchar -> wchar) -> string -> string-
map f smaps the functionfover the UTF-8 encoded characters in the stringsto produce a new UTF-8 string. This expression raises theIncompleteexception if it encounters an incomplete UTF-8 character and it raises theInvalidexception if it encounters an invalid encoding. It is equivalent to the expressionimplode (List.map f (explode s)) val app : (wchar -> unit) -> string -> unit-
app f sapplies the functionfto the UTF-8 encoded characters in the strings. This expression raises theIncompleteexception if it encounters an incomplete UTF-8 character and it raises theInvalidexception if it encounters an invalid encoding. It is equivalent to the expressionList.app f (explode s) val fold : ((wchar * 'a) -> 'a) -> 'a -> string -> 'a-
fold f init sfolds a function from left-to-right over the UTF-8 encoded characters in the string.Incompleteexception if it encounters an incomplete UTF-8 character and it raises theInvalidexception if it encounters an invalid encoding. It is equivalent to the expressionList.foldl f init (explode s) val all : (wchar -> bool) -> string -> bool-
all pred sreturnstrueif, and only if, the functionpredreturns true for all of the UTF-8 encoded characters in the string. It short-circuits evaluation as soon as a character is encountered for whichpredreturnsfalse. This expression raises theIncompleteexception if it encounters an incomplete UTF-8 character and it raises theInvalidexception if it encounters an invalid encoding. It is equivalent to the expressionList.all pred (explode s)when
sonly contains complete characters. val exists : (wchar -> bool) -> string -> bool-
exists pred sreturnstrueif, and only if, the functionpredreturnstruefor at least one UTF-8 encoded character in the strings. It short-circuits evaluation as soon as a character is encountered for whichpredreturnstrue. This expression raises theIncompleteexception if it encounters an incomplete UTF-8 character and it raises theInvalidexception if it encounters an invalid encoding. It is equivalent to the expressionList.exists pred (explode s)when
sonly contains complete characters.