Camlp4 grammar system is a way to parse streams with more power and
more convenience than simple stream parsers. A grammar deals with
operator precedences and associativity, and left factorization, and
its entries are extensible.
The Camlp4 library provides a module ``Grammar
'' and a syntax
extension file ``pa_extend.cmo
'' to program grammars. All
details on modules, types and functions are described
chapter 8.
A grammar can be created using the function
``Grammar.gcreate
''. It takes a lexer as parameter. A good
candidate it the function ``Plexer.gmake
'' but the user can
creates its own lexer, providing that it is of type
``Token.glexer
''.
A grammar holds entries. An entry can be created using the function
``Grammar.Entry.create
''. The first parameter is the grammar,
the second one is a string used to name the entry in error messages
and in entries dump. Entries are created empty, i.e. raising
``Stream.Error
'' when called. An entry is composed of entry
precedence levels, the first one being the least precedent and the
last one the most.
Lastly, a stream can be parsed by en entry using the function
``Grammar.Entry.parse
''. In case of syntax error, the
exception ``Stream.Error
'' is raised, encapsulated by
the exception ``Stdpp.Exc_located
'', giving the location
of the error.
An entry can be extended using the function
``Grammar.extend
''. But its interface being quite complicated
and, as it must be used with appropriate type constraints, the Camlp4
library provides a file, named ``pa_extend.cmo
'', compatible
with ``pa_o.cmo
'' and ``pa_r.cmo
'' which creates a new
instruction doing this work.
This instruction is ``EXTEND
'' which has the following
format:
EXTEND |
{ GLOBAL : global-list ; } |
entry : { position } extension ; |
... |
entry : { position } extension ; |
END |
EXTEND
, GLOBAL
and END
are keywords.
There are some other keywords in this instruction, all in uppercase.
The entry is a name of an entry variable (simple identifier or
prefixed by a module name). It is extended according to the rules
given by extension. The optional position tells how to
insert the extension among the entry precedence levels: by default,
the extension is done at the first precedence level (creating this
level if the entry is empty).
The optional global-list restricts the list of the global
entries (separated by spaces) extended in the ``EXTEND
''
instruction. The other entries are created locally. By default, all
entries are global and must correspond to entry variables visible at
the ``EXTEND
'' instruction point.
A position can be:
-
FIRST
: The extension is inserted at the beginning
of the precedence levels.
LAST
: The extension is inserted as the end of the
precedence levels.
BEFORE
label: The extension is inserted
before the precedence level so labelled.
AFTER
label: The extension is inserted
after the precedence level so labelled.
LEVEL
label: The extension is inserted
at the precedence level so labelled.
Only the case LEVEL
extends already existing levels: the other
cases create new levels.
The syntax of an entry extension is:
[ |
{ label } { associativity } level-rules |
|
| |
... |
|
| |
{ label } { associativity } level-rules |
] |
The optional label identifies the current level for possible
future extensions. The optional associativity can be
LEFTA
, RIGHTA
or NONA
for respectively left,
right and no associativity: the default is left associativity.
The level-rules have the format:
[ |
{ pattern = } symbol ; ... { pattern
= } symbol { -> action } |
|
| |
... |
|
| |
{ pattern = } symbol ; ... { pattern
= } symbol { -> action } |
] |
The pattern is a language simple pattern, limited to
identifiers, wildcard (character ``underscore'') and tuples of
patterns. If not present, it is equivalent to the wildcard.
The action is a language expression where the patterns are bound
to the result of their corresponding symbol; in addition, the
variable ``loc
'' is bound to the source location of the rule.
The action part is optional; by default it is the value ``()
''.
A symbol is either:
-
A string
- meaning: a keyword. If this keyword does not exist in
the current grammar, it is added. The syntax of the keyword must
respect the rules of the lexer. Its type is
string
.
- An entry name
- (simple identifier or prefixed by a module name)
meaning a call to this entry. Its type is the entry result type. The
entry name can be followed by the keyword
LEVEL
and a string,
meaning a direct call to that level.
- The keyword SELF
- meaning the entry being extended (first level).
- The keyword NEXT
- meaning the entry being extended at the next level.
- A token pattern.
- Tokens patterns are identifiers starting with
an uppercase letter optionally followed by a string; they recognize
tokens generated by the lexer. See for example, the constructors
generated by the predefined lexer (section ??), lexing
OCaml normal and revised syntaxes.
- A metasymbol
- among:
- A level-rules
- meaning an anonymous entry with only one level.
- A symbol between parentheses
-
Entries are stream parsers improved. Streams use recursive descendant
parsing (something close to LL(1)). Each precedence level can be seen
as an independent extensible stream parser.
In a precedence level, the rules are factorized, i.e. several rules
can start with the same symbols, and the parser is represented by a
tree of symbols where leaves are either semantic actions, or dead
ends.
When an extension is done, the new rules are inserted in the
precedence level tree. The system does not check whether the entry
precedence level will ``work'', nor whether it would work together
with the other levels and the other entries.
This insertion is generally done at the end of the rules list. The
special cases are for symbols of type token, which are always inserted
before the other symbols (non terminals, lists, options, etc),
tokens with parameters before tokens without parameters (parameter is
the empty string).
There is no factorization of rules between different levels nor, a fortiori,
with other entries.
While parsing an entry, the system tries the first precedence
level. Then, if it fails, it tries the second one, etc. If the last
one fails, the entry fails and the caller of this entry resumes,
either by raising a parse error (the parsing is then stopped), or
by trying other rules.
Entries rules can be deleted using the function
``Grammar.delete_rule
''. But, like for
``Grammar.extend
'', it is not documented. One must use the
instruction ``DELETE_RULE
'', generating a call to this
function. This instruction is a syntax extension, loaded together with
the instruction ``EXTEND
'' by the file
``pa_extend.cmo
''.
The syntax of ``DELETE_RULE
'' is:
DELETE_RULE |
entry : symbol ; ... symbol |
END |
The rule defined by the list of symbols (semantic actions forgotten),
if found, is deleted.
The lexers have to create tokens of any type you want. In the lexer
interface (see below), you have to provide the way a token pattern
(which is of type (string * string)
) has to be compared with
a token value of your token type.
A simple token type is (string * string)
like the token pattern
type. Its advantage is that a default function is provided in the
module Token
for the way the token patterns are parsed.
The token patterns are defined in the EXTEND
statement, as
terminal symbols in the rules. In a rule, an identifier in uppercase
character not among the reserved ones (LIST0, OPT, and so on) is
a token pattern. The identifier is the first string in the token
pattern type. If the symbol is followed by a string, it is the second
string of the token pattern type; if not, the string is
empty. Moreover a string alone is also a token pattern with the first
string being empty.
Examples: the following symbol rules in an EXTEND statement:
IDENT "bar"
INT "32"
x = IDENT
i = INT
"let"
correspond (resp.) to the following token patterns:
("IDENT", "bar")
("INT", "32")
("IDENT", "")
("INT", "")
("", "let")
A lexer appropriate for the Grammar.gcreate
function is a record
of type Token.glexer
which is a record with the following fields:
tok_func
: it is the main lexing function. It is called once
when calling Grammar.Entry.parse
with the input character
stream. The simplest way to create this function is to use either the
function Token.lexer_func_of_parser
if you use a stream parser,
or Token.lexer_func_of_ocamllex
if you prefer use an ocamllex
function. Notice that in both cases, your lexer function must return
a couple of a token value (of your token type) and a token location
(a couple of integers).
tok_using
: it is a function taking a token pattern as
parameter. When EXTEND
statement first scans all symbols in all
rules, it calls this function with the token patterns encountered (a
token pattern is of type (string * string)
). This function
allows you to check the pattern (that the constructor is among the
ones the lexer generates: raise an exception if not) and enter
keywords (if you have keywords) in your keyword table (if you use a
keyword table).
tok_removing
: like tok_using
but used when a rule is
removed.
tok_match
: tells how a token pattern is parsed. It must
return a function taking a token as parameter, returning either the
resulting string or the exception Stream.Failure
. Use the
function Token.default_match
for a default behavior when your
token type is (string * string)
. Remark: for efficiency, if you
write your own version, write is as a function taking a token pattern
and returning a function for each case, *not* as a function with two
parameters.
tok_text
: give the name of a token pattern to be used in
error messages. You can use the function named lexer_text
provided in the module Token
.