Extended Backus Naur Form - EBNF
There is a need to formulate programming languages very
formally,
in order toensure that they can be recognised by computer programs. To
help with this a standard form of notation is used. The original format
was adapted from the area of linguistics during the formulation of the
programming language Algol 60 in 1958. To honour the leading figure in
this effort, Peter Naur, the revised notation was renamed from Backus
Normal
form to Backus Naur Form. It is widely known as simply BNF.
What is Backus Naur Form?
How do we write our definitions of C and
C++ elements
Most programming languages are today defined
as having grammars written in some extension of BNF. The precise
layout and use of meta-symbols varies. Here we consider an Extended
BNF (EBNF) typical of the variants used to describe the C programming
language
family.
A grammar written in this way defines rules of syntax.
These rules specify what sequences of symbols are allowed to occur in a
legal program and in what ways these symbols may be combined. They do not
specify how such sequences are to be interpreted. That requires
semantic
rules, which attach meaning to the syntactic structures of the grammar
All versions of BNF and EBNF have basically the same elements.
-
A set of rules is specified. These are known as production rules.
They define the patterns or sequences of symbols allowed in the
language.
-
Each production rule defines the pattern that represents a named structured
part of the language, such as an expression or a statement. The name of
such a part is called a non-terminal symbol in the language.
-
The basic building blocks of the language are symbols which stand for
themselves.
These can be individual characters or combinations such as keywords and
arithmetic operators. These basic elements of the language are called
terminal
symbols.
-
Each rule contains the name of the non-terminal being defined, followed
by some convention that says the rest of the rule now starts. This is then
followed by the sequence or alternative sequences allowed for that symbol.
A defining sequence can contain any terminal symbols allowed for that
language.
It can also contan non-terminal symbols defined in other rules. If a
non-terminal
is used without being defined, the grammar for the language being defined
is incomplete.
-
The definition of a rule can also contain the symbol being defined by that
rules as part of its definition. This is called recursive
definition.
Any symbol defined in this way must have an alternative definition which
does not contain itself, otherwise the grammar is again incomplete.
In the rules we write to define C and C++ in
these notes, we will write as follows:
-
Terminal symbols
-
will be written in bold, black type;
will be written in italic type;
on colour browsers they will appear in
red;
- The symbol separating
-
the non-terminal being defined and its definitions will be a space or,
more usually, the start of a new line followed by increased indentation;
thus
if-statement
if ( expression ) statement
defines a new symbol, if-statement to
be represented in a program
as the keyword if, followed by the terminal symbol (,
followed
by a sequence which matches the definition of the non-terminal
expression,
followed by the terminal ), followed by a sequence matching the
definition of a statement.
-
More than one pattern representing a particular symbol,
-
could be written in separate rules, each having the same defined symbol.
It is simpler, however, to write them as a series of definitions on
separate
lines, all indented by the same amount more than the symbol they define;
this allows definitions which are too long to fit on one line to continue
on the next line, with extra indentation to show that they are not
alternatives.
Thus:
if-statement
if ( expression ) statement
if ( expression ) statement
else statement
gives two rules for the appearance of an if-statement. The second
of these is split over two lines, for readability.
-
Optional symbols
-
which can be included or omitted in a sequence, are indicated by
a question mark, written in non-bold, roman type and following the optional
symbol. It is necessary to make it clear when a symbol is not intended
to form part of the sequence being described, but is merely part of a
notation
(meta-language) used to help in defining what can appear. On
browsers
with colout, this sort of meta symbol will additionally appear in
green. Where a series of symbols must either all be present or all left
out, they will be enclosed in parentheses, with the question mark following
the closing parenthesis. Like the question mark these parentheses are
meta-symbols
and so will appear in the same format. Thus:
if-statement
if ( expression ) statement ( else statement )?
defines exactly the same possibilities as the previous rule for an
if-statement.
- Repetition of a symbol
-
within a sequence is defined by writing the symbol followed by an asterisk,
to mean that a sub-sequence matching that symbol should appear zero or
more times. It is sometimes useful to use a plus sign instead, to mean
"at least once and then as many times as are wanted". Both of these are
meta-symbols and so are written in the format used previously for the
option
symbol. Again, a longer sequence may be indicated as being repeated by
enclosing it in meta-language parentheses. Thus:
compound-statement
{
( declaration )*
( statement )*
}
gives the ANSI C (but not C++) syntax for a compound statement or block.
Back to the C notes index
Back to the C++ notes index