next up previous contents
Next: A parser generator Up: No Title Previous: Quotations and antiquotations

A lexer generator

 

This section describes mosmllex, a lexer generator which is closely based on camllex from the Caml Light implementation by Xavier Leroy. This documentation is based on that of camllex also.

Overview

Given a set of regular expressions with attached semantic actions, mosmllex produces a lexical analyser in the style of lex. If file lexer.lex contains a specification of a lexical analyser, then executing


program1014

produces a file lexer.sml containing Moscow ML code for the lexical analyser. This file defines one lexing function per entry point in the lexer definition. These functions have the same names as the entry points. Lexing functions take as argument a lexer buffer, and return the semantic attribute of the corresponding entry point.

Lexer buffers are an abstract data type implemented in the library unit Lexing. The functions
createLexerString and createLexer from unit Lexing create lexer buffers that read from a character string, or any reading function, respectively.

When used in conjunction with a parser generated by mosmlyac (see Section 18), the semantic actions compute a value belonging to the datatype token defined by the generated parsing unit.

Example uses of mosmllex can be found in directories calc and lexyacc under mosml/examples.

Hints on using mosmllex

A lexer definition must have a rule to recognize the special symbol eof, meaning end-of-file. In general, a lexer must be able to handle all characters that can appear in the input. This is usually achieved by putting the wildcard case _ at the very end of the lexer definition. If the lexer is to be used with e.g. MS Windows, MS DOS or MacOS files, remember to provide a rule for the carriage-return symbol \r. Most often \r will be treated the same as \n, e.g. as whitespace.

Do not use string constants to define many keywords; this may produce large lexer programs. It is better to let the lexer scan keywords the same way as identifiers and then use an auxiliary function to distinguish between them. For an example, see the keyword function in mosml/examples/lexyacc/Lexer.lex.

Syntax of lexer definitions

The format of a lexer definition is as follows:


program1033

Comments are delimited by (* and *), as in SML. An abbreviation (abbrev) for a regular expression may refer only to abbreviations that strictly precede it in the list of abbreviations; in particular, abbreviations cannot be recursive.

Header

The header section is arbitrary Moscow ML text enclosed in curly braces { and }. It can be omitted. If it is present, the enclosed text is copied as is at the beginning of the output file lexer.sml. Typically, the header section contains the open directives required by the actions, and possibly some auxiliary functions used in the actions.

Entry points

The names of the entry points must be valid ML identifiers.

Regular expressions

The regular expressions regexp are in the style of lex, but with a more ML-like syntax.

`char`

A character constant, with a syntax similar to that of Moscow ML character constants; see Section 17.3.5. Match the denoted character.

_

Match any character.

eof

Match the end of the lexer input.

"string"

A string constant, with a syntax similar to that of Moscow ML string constants; see Section 17.3.6. Match the denoted string.

[ character-set ]

Match any single character belonging to the given character set. Valid character sets are: single character constants `c`; ranges of characters `c1` - `c2` (all characters between c1 and c2, inclusive); and the union of two or more character sets, denoted by concatenation.

[ ^ character-set ]

Match any single character not belonging to the given character set.

regexp *

Match the concatenation of zero or more strings that match regexp. (Repetition).

regexp +

Match the concatenation of one or more strings that match regexp. (Positive repetition).

regexp ?

Match either the empty string, or a string matching regexp. (Option).

regexp1 | regexp2

Match any string that matches either regexp1 or regexp2. (Alternative).

regexp1 regexp2

Match the concatenation of two strings, the first matching regexp1, the second matching regexp2. (Concatenation).

abbrev

Match the same strings as the regexp in the most recent let-binding of abbrev.

( regexp )

Match the same strings as regexp.

The operators * and + have highest precedence, followed by ?, then concatenation, then | (alternative).

Actions

An action is an arbitrary Moscow ML expression. An action is evaluated in a context where the identifier lexbuf is bound to the current lexer buffer. Some typical uses of lexbuf in conjunction with the operations on lexer buffers (provided by the Lexing library unit) are listed below.

Lexing.getLexeme lexbuf

Return the matched string.

Lexing.getLexemeChar lexbuf n

Return the n'th character in the matched string. The first character has number 0.

Lexing.getLexemeStart lexbuf

Return the absolute position in the input text of the beginning of the matched string. The first character read from the input text has position 0.

Lexing.getLexemeEnd lexbuf

Return the absolute position in the input text of the end of the matched string. The first character read from the input text has position 0.

entrypoint lexbuf

Here entrypoint is the name of another entry point in the same lexer definition. Recursively call the lexer on the given entry point. Useful for lexing nested comments, for example.

Character constants

 

A character constant in the lexer definition is delimited by ` (backquote) characters. The two backquotes enclose either a space or a printable character c, different from ` and \, or an escape sequence:


quot1114

String constants

 

A string constant is a (possibly empty) sequence of characters delimited by " (double quote) characters.


quot1128

A string character strchar is a space, or a printable character c (except " and \), or an escape sequence:


quot1144


next up previous contents
Next: A parser generator Up: No Title Previous: Quotations and antiquotations

Moscow ML 1.44