Grammars

Grammars

Camlp4 grammar system is a way to parse streams with more power and more convenience than simple stream parsers. A grammar deals with operator precedences and associativity, and left factorization, and its entries are extensible.

4.1

Grammars and entries

The Camlp4 library provides a module ``Grammar'' and a syntax extension file ``pa_extend.cmo'' to program grammars. All details on modules, types and functions are described chapter 8.

A grammar can be created using the function ``Grammar.gcreate''. It takes a lexer as parameter. A good candidate it the function ``Plexer.gmake'' but the user can creates its own lexer, providing that it is of type ``Token.glexer''.

A grammar holds entries. An entry can be created using the function ``Grammar.Entry.create''. The first parameter is the grammar, the second one is a string used to name the entry in error messages and in entries dump. Entries are created empty, i.e. raising ``Stream.Error'' when called. An entry is composed of entry precedence levels, the first one being the least precedent and the last one the most.

Lastly, a stream can be parsed by en entry using the function ``Grammar.Entry.parse''. In case of syntax error, the exception ``Stream.Error'' is raised, encapsulated by the exception ``Stdpp.Exc_located'', giving the location of the error.

4.2

Extension

An entry can be extended using the function ``Grammar.extend''. But its interface being quite complicated and, as it must be used with appropriate type constraints, the Camlp4 library provides a file, named ``pa_extend.cmo'', compatible with ``pa_o.cmo'' and ``pa_r.cmo'' which creates a new instruction doing this work.

This instruction is ``EXTEND'' which has the following format:

EXTEND

{ GLOBAL : global-list ; }

entry : { position } extension ;

...

entry : { position } extension ;

END

EXTEND, GLOBAL and END are keywords. There are some other keywords in this instruction, all in uppercase.

The entry is a name of an entry variable (simple identifier or prefixed by a module name). It is extended according to the rules given by extension. The optional position tells how to insert the extension among the entry precedence levels: by default, the extension is done at the first precedence level (creating this level if the entry is empty).

The optional global-list restricts the list of the global entries (separated by spaces) extended in the ``EXTEND'' instruction. The other entries are created locally. By default, all entries are global and must correspond to entry variables visible at the ``EXTEND'' instruction point.

Syntax of a position

A position can be:

FIRST: The extension is inserted at the beginning of the precedence levels.
LAST: The extension is inserted as the end of the precedence levels.
BEFORE label: The extension is inserted before the precedence level so labelled.
AFTER label: The extension is inserted after the precedence level so labelled.
LEVEL label: The extension is inserted at the precedence level so labelled.

Only the case LEVEL extends already existing levels: the other cases create new levels.

Syntax of an extension

The syntax of an entry extension is:

`[`	{ label } { associativity } level-rules
`\|`	...
`\|`	{ label } { associativity } level-rules	`]`

The optional label identifies the current level for possible future extensions. The optional associativity can be LEFTA, RIGHTA or NONA for respectively left, right and no associativity: the default is left associativity.

Syntax of a level-rules

The level-rules have the format:

`[`	{ pattern `=` } symbol ; ... { pattern `=` } symbol { `->` action }
`\|`	...
`\|`	{ pattern `=` } symbol ; ... { pattern `=` } symbol { `->` action }	`]`

The pattern is a language simple pattern, limited to identifiers, wildcard (character ``underscore'') and tuples of patterns. If not present, it is equivalent to the wildcard.

The action is a language expression where the patterns are bound to the result of their corresponding symbol; in addition, the variable ``loc'' is bound to the source location of the rule. The action part is optional; by default it is the value ``()''.

A symbol is either:

A string

meaning: a keyword. If this keyword does not exist in the current grammar, it is added. The syntax of the keyword must respect the rules of the lexer. Its type is string.

An entry name

(simple identifier or prefixed by a module name) meaning a call to this entry. Its type is the entry result type. The entry name can be followed by the keyword LEVEL and a string, meaning a direct call to that level.

The keyword SELF

meaning the entry being extended (first level).

The keyword NEXT

meaning the entry being extended at the next level.

A token pattern.

Tokens patterns are identifiers starting with an uppercase letter optionally followed by a string; they recognize tokens generated by the lexer. See for example, the constructors generated by the predefined lexer (section ??), lexing OCaml normal and revised syntaxes.

A metasymbol

among:

LIST0 and LIST1 whose syntax is:
LISTx symbol1 { SEP symbol2 }
meaning a list (potentially empty for LIST0 and with at least one element for LIST1) of symbol1, whose elements are optionally separated by symbol2. The type is t1 list where t1 is the type of symbol1 (the result of the optional symbol2 is lost).
OPT followed by a symbol, meaning this symbol or nothing. The type is t option where t is the type of symbol.

A level-rules

meaning an anonymous entry with only one level.

A symbol between parentheses

4.3

Parsed language

Entries are stream parsers improved. Streams use recursive descendant parsing (something close to LL(1)). Each precedence level can be seen as an independent extensible stream parser.

In a precedence level, the rules are factorized, i.e. several rules can start with the same symbols, and the parser is represented by a tree of symbols where leaves are either semantic actions, or dead ends.

When an extension is done, the new rules are inserted in the precedence level tree. The system does not check whether the entry precedence level will ``work'', nor whether it would work together with the other levels and the other entries.

This insertion is generally done at the end of the rules list. The special cases are for symbols of type token, which are always inserted before the other symbols (non terminals, lists, options, etc), tokens with parameters before tokens without parameters (parameter is the empty string).

There is no factorization of rules between different levels nor, a fortiori, with other entries.

While parsing an entry, the system tries the first precedence level. Then, if it fails, it tries the second one, etc. If the last one fails, the entry fails and the caller of this entry resumes, either by raising a parse error (the parsing is then stopped), or by trying other rules.

4.4

Deletion

Entries rules can be deleted using the function ``Grammar.delete_rule''. But, like for ``Grammar.extend'', it is not documented. One must use the instruction ``DELETE_RULE'', generating a call to this function. This instruction is a syntax extension, loaded together with the instruction ``EXTEND'' by the file ``pa_extend.cmo''.

The syntax of ``DELETE_RULE'' is:

DELETE_RULE

entry : symbol ; ... symbol

END

The rule defined by the list of symbols (semantic actions forgotten), if found, is deleted.

4.5

Writing a lexer

The lexers have to create tokens of any type you want. In the lexer interface (see below), you have to provide the way a token pattern (which is of type (string * string)) has to be compared with a token value of your token type.

A simple token type is (string * string) like the token pattern type. Its advantage is that a default function is provided in the module Token for the way the token patterns are parsed.

The token patterns are defined in the EXTEND statement, as terminal symbols in the rules. In a rule, an identifier in uppercase character not among the reserved ones (LIST0, OPT, and so on) is a token pattern. The identifier is the first string in the token pattern type. If the symbol is followed by a string, it is the second string of the token pattern type; if not, the string is empty. Moreover a string alone is also a token pattern with the first string being empty.

Examples: the following symbol rules in an EXTEND statement:

     IDENT "bar"
     INT "32"
     x = IDENT
     i = INT
     "let"

correspond (resp.) to the following token patterns:

     ("IDENT", "bar")
     ("INT", "32")
     ("IDENT", "")
     ("INT", "")
     ("", "let")

A lexer appropriate for the Grammar.gcreate function is a record of type Token.glexer which is a record with the following fields:

tok_func: it is the main lexing function. It is called once when calling Grammar.Entry.parse with the input character stream. The simplest way to create this function is to use either the function Token.lexer_func_of_parser if you use a stream parser, or Token.lexer_func_of_ocamllex if you prefer use an ocamllex function. Notice that in both cases, your lexer function must return a couple of a token value (of your token type) and a token location (a couple of integers).
tok_using: it is a function taking a token pattern as parameter. When EXTEND statement first scans all symbols in all rules, it calls this function with the token patterns encountered (a token pattern is of type (string * string)). This function allows you to check the pattern (that the constructor is among the ones the lexer generates: raise an exception if not) and enter keywords (if you have keywords) in your keyword table (if you use a keyword table).
tok_removing: like tok_using but used when a rule is removed.
tok_match: tells how a token pattern is parsed. It must return a function taking a token as parameter, returning either the resulting string or the exception Stream.Failure. Use the function Token.default_match for a default behavior when your token type is (string * string). Remark: for efficiency, if you write your own version, write is as a function taking a token pattern and returning a function for each case, *not* as a function with two parameters.
tok_text: give the name of a token pattern to be used in error messages. You can use the function named lexer_text provided in the module Token.