This section describes mosmllex, a lexer generator which is closely based on camllex from the Caml Light implementation by Xavier Leroy. This documentation is based on that of camllex also.
Given a set of regular expressions with attached semantic actions, mosmllex produces a lexical analyser in the style of lex. If file lexer.lex contains a specification of a lexical analyser, then executing
![]()
produces a file lexer.sml containing Moscow ML code for the lexical analyser. This file defines one lexing function per entry point in the lexer definition. These functions have the same names as the entry points. Lexing functions take as argument a lexer buffer, and return the semantic attribute of the corresponding entry point.
Lexer buffers are an abstract data type implemented in the library
unit Lexing. The functions
createLexerString and
createLexer from unit Lexing create lexer buffers that read
from a character string, or any reading function, respectively.
When used in conjunction with a parser generated by mosmlyac (see Section 18), the semantic actions compute a value belonging to the datatype token defined by the generated parsing unit.
Example uses of mosmllex can be found in directories calc and lexyacc under mosml/examples.
A lexer definition must have a rule to recognize the special symbol
eof, meaning end-of-file. In general, a lexer must be able to
handle all characters that can appear in the input. This is usually
achieved by putting the wildcard case _ at the very end of the
lexer definition. If the lexer is to be used with e.g. MS Windows,
MS DOS or MacOS files, remember to provide a rule for the
carriage-return symbol \r. Most often \r will be
treated the same as \n, e.g. as whitespace.
Do not use string constants to define many keywords; this may produce large lexer programs. It is better to let the lexer scan keywords the same way as identifiers and then use an auxiliary function to distinguish between them. For an example, see the keyword function in mosml/examples/lexyacc/Lexer.lex.
The format of a lexer definition is as follows:

Comments are delimited by (* and *), as in SML. An abbreviation (abbrev) for a regular expression may refer only to abbreviations that strictly precede it in the list of abbreviations; in particular, abbreviations cannot be recursive.
The header section is arbitrary Moscow ML text enclosed in curly
braces { and }. It can be omitted. If it is present,
the enclosed text is copied as is at the beginning of the output file
lexer.sml. Typically, the header section contains the
open directives required by the actions, and possibly some
auxiliary functions used in the actions.
The names of the entry points must be valid ML identifiers.
The regular expressions regexp are in the style of lex, but with a more ML-like syntax.
The operators * and + have highest precedence, followed by ?, then concatenation, then | (alternative).
An action is an arbitrary Moscow ML expression. An action is evaluated in a context where the identifier lexbuf is bound to the current lexer buffer. Some typical uses of lexbuf in conjunction with the operations on lexer buffers (provided by the Lexing library unit) are listed below.
Return the matched string.
Return the n'th character in the matched string. The first character has number 0.
Return the absolute position in the input text of the beginning of the matched string. The first character read from the input text has position 0.
Return the absolute position in the input text of the end of the matched string. The first character read from the input text has position 0.
Here entrypoint is the name of another entry point in the same lexer definition. Recursively call the lexer on the given entry point. Useful for lexing nested comments, for example.
A character constant in the lexer definition is delimited by `
(backquote) characters. The two backquotes enclose either a space or
a printable character c, different from ` and \,
or an escape sequence:

A string constant is a (possibly empty) sequence of characters delimited by " (double quote) characters.

A string character strchar is a space, or a
printable character c (except " and \), or an
escape sequence:
