Previous Up Next
Chapter 4 Grammars
Camlp4 grammar system is a way to parse streams with more power and more convenience than simple stream parsers. A grammar deals with operator precedences and associativity, and left factorization, and its entries are extensible.

4.1 Grammars and entries

The Camlp4 library provides a module ``Grammar'' and a syntax extension file ``pa_extend.cmo'' to program grammars. All details on modules, types and functions are described chapter 8.

A grammar can be created using the function ``Grammar.gcreate''. It takes a lexer as parameter. A good candidate it the function ``Plexer.gmake'' but the user can creates its own lexer, providing that it is of type ``Token.glexer''.

A grammar holds entries. An entry can be created using the function ``Grammar.Entry.create''. The first parameter is the grammar, the second one is a string used to name the entry in error messages and in entries dump. Entries are created empty, i.e. raising ``Stream.Error'' when called. An entry is composed of entry precedence levels, the first one being the least precedent and the last one the most.

Lastly, a stream can be parsed by en entry using the function ``Grammar.Entry.parse''. In case of syntax error, the exception ``Stream.Error'' is raised, encapsulated by the exception ``Stdpp.Exc_located'', giving the location of the error.

4.2 Extension

An entry can be extended using the function ``Grammar.extend''. But its interface being quite complicated and, as it must be used with appropriate type constraints, the Camlp4 library provides a file, named ``pa_extend.cmo'', compatible with ``pa_o.cmo'' and ``pa_r.cmo'' which creates a new instruction doing this work.

This instruction is ``EXTEND'' which has the following format:
EXTEND
{ GLOBAL : global-list ; }
entry : { position } extension ;
...
entry : { position } extension ;
END


EXTEND, GLOBAL and END are keywords. There are some other keywords in this instruction, all in uppercase.

The entry is a name of an entry variable (simple identifier or prefixed by a module name). It is extended according to the rules given by extension. The optional position tells how to insert the extension among the entry precedence levels: by default, the extension is done at the first precedence level (creating this level if the entry is empty).

The optional global-list restricts the list of the global entries (separated by spaces) extended in the ``EXTEND'' instruction. The other entries are created locally. By default, all entries are global and must correspond to entry variables visible at the ``EXTEND'' instruction point.

Syntax of a position

A position can be: Only the case LEVEL extends already existing levels: the other cases create new levels.

Syntax of an extension

The syntax of an entry extension is:
[ { label } { associativity } level-rules  
| ...  
| { label } { associativity } level-rules ]

The optional label identifies the current level for possible future extensions. The optional associativity can be LEFTA, RIGHTA or NONA for respectively left, right and no associativity: the default is left associativity.

Syntax of a level-rules

The level-rules have the format:
[ { pattern = } symbol ; ... { pattern = } symbol { -> action }  
| ...  
| { pattern = } symbol ; ... { pattern = } symbol { -> action } ]

The pattern is a language simple pattern, limited to identifiers, wildcard (character ``underscore'') and tuples of patterns. If not present, it is equivalent to the wildcard.

The action is a language expression where the patterns are bound to the result of their corresponding symbol; in addition, the variable ``loc'' is bound to the source location of the rule. The action part is optional; by default it is the value ``()''.

A symbol is either:
A string
meaning: a keyword. If this keyword does not exist in the current grammar, it is added. The syntax of the keyword must respect the rules of the lexer. Its type is string.
An entry name
(simple identifier or prefixed by a module name) meaning a call to this entry. Its type is the entry result type. The entry name can be followed by the keyword LEVEL and a string, meaning a direct call to that level.
The keyword SELF
meaning the entry being extended (first level).
The keyword NEXT
meaning the entry being extended at the next level.
A token pattern.
Tokens patterns are identifiers starting with an uppercase letter optionally followed by a string; they recognize tokens generated by the lexer. See for example, the constructors generated by the predefined lexer (section ??), lexing OCaml normal and revised syntaxes.
A metasymbol
among:
A level-rules
meaning an anonymous entry with only one level.
A symbol between parentheses
4.3 Parsed language

Entries are stream parsers improved. Streams use recursive descendant parsing (something close to LL(1)). Each precedence level can be seen as an independent extensible stream parser.

In a precedence level, the rules are factorized, i.e. several rules can start with the same symbols, and the parser is represented by a tree of symbols where leaves are either semantic actions, or dead ends.

When an extension is done, the new rules are inserted in the precedence level tree. The system does not check whether the entry precedence level will ``work'', nor whether it would work together with the other levels and the other entries.

This insertion is generally done at the end of the rules list. The special cases are for symbols of type token, which are always inserted before the other symbols (non terminals, lists, options, etc), tokens with parameters before tokens without parameters (parameter is the empty string).

There is no factorization of rules between different levels nor, a fortiori, with other entries.

While parsing an entry, the system tries the first precedence level. Then, if it fails, it tries the second one, etc. If the last one fails, the entry fails and the caller of this entry resumes, either by raising a parse error (the parsing is then stopped), or by trying other rules.

4.4 Deletion

Entries rules can be deleted using the function ``Grammar.delete_rule''. But, like for ``Grammar.extend'', it is not documented. One must use the instruction ``DELETE_RULE'', generating a call to this function. This instruction is a syntax extension, loaded together with the instruction ``EXTEND'' by the file ``pa_extend.cmo''.

The syntax of ``DELETE_RULE'' is:
DELETE_RULE
entry : symbol ; ... symbol
END

The rule defined by the list of symbols (semantic actions forgotten), if found, is deleted.

4.5 Writing a lexer

The lexers have to create tokens of any type you want. In the lexer interface (see below), you have to provide the way a token pattern (which is of type (string * string)) has to be compared with a token value of your token type.

A simple token type is (string * string) like the token pattern type. Its advantage is that a default function is provided in the module Token for the way the token patterns are parsed.

The token patterns are defined in the EXTEND statement, as terminal symbols in the rules. In a rule, an identifier in uppercase character not among the reserved ones (LIST0, OPT, and so on) is a token pattern. The identifier is the first string in the token pattern type. If the symbol is followed by a string, it is the second string of the token pattern type; if not, the string is empty. Moreover a string alone is also a token pattern with the first string being empty.

Examples: the following symbol rules in an EXTEND statement:
     IDENT "bar"
     INT "32"
     x = IDENT
     i = INT
     "let"
correspond (resp.) to the following token patterns:
     ("IDENT", "bar")
     ("INT", "32")
     ("IDENT", "")
     ("INT", "")
     ("", "let")
A lexer appropriate for the Grammar.gcreate function is a record of type Token.glexer which is a record with the following fields:

Previous Up Next