Recursive Descent Parsing
This is just a brief introduction
to recursive descent parsing. A good compiler textbook
provides substantially more details. The problem we need to solve is as
following: Given a grammar in BNF, write a parser for it. One of the simplest
but at the same time very robust approach would proceed as follows:
This is the most "programmer-friendly" method of syntax analysis and
it you design a language it make sense to put some preliminary
transformation that allow to use this parsing strategy. It extremely
well integrates with lexical analyzer both in "one a time form" or a
separate pass" form.
A lot can be done on lexical analysis level. Pascal was the
first language deliberately limited to the grammar that can be parsed
using recursive decent. While in this are it went overboard it was a
pretty successful language so there is no real ground to suspect that
limiting yourself to recursive decent parser you limit the
attractiveness of your language.
We know that all strings in the language must be derived from the start
nonterminal S. Hence if we can write a procedure S()
which matches all strings derived from S, then we are done. But
how do we do this?
Let's generalize the above question somewhat. Instead of merely posing
the problem of writing a procedure which matches strings derived from the
specific start nonterminal S, let's look at the more general
problem of writing a procedure which matches strings derived from any nonterminal
N; i.e., for each nonterminal N, we would like to
construct a parsing procedure N() which matches all
strings derived from N.
If a nonterminal N has only a single grammar rule, then its
parsing procedure is easy to write. To match the rule we know that
we need to match each grammar symbol on its right-hand side (RHS). There
are two possibilities:
- If the RHS symbol is a terminal symbol, check that the current lookahead
matches that terminal symbol. If it does, then advance the input, setting
the current lookahead to the next input symbol. If it does not, signal
an error.
- For each occurrence of a nonterminal symbol on the RHS, call the
corresponding parsing procedure. If the parsing procedure
works as posited above, then it will match a prefix of the input with
a rule for the nonterminal.
If a nonterminal N has multiple grammar rules, then the parsing
procedure will need to decide which rule to use. It can do so by looking
at the current lookahead token to see which of the candidate rules can start
with the lookahead. If only a single rule can start with the lookahead,
then that rule is chosen. If it is always possible to predict a unique rule
for any nonterminal by looking ahead by at most one token, then the grammar
is said to be LL(1).
Hence a parsing function N() for nonterminal
N will simply contain logic to pick a rule based on the current lookahead,
followed by code to match each individual rule. Its structure will look
something like the following:
N() {
if (lookahead can start first rule for N) {
match rule 1 for N
}
else if (lookahead can start second rule for N) {
match rule 2 for N
}
... ...
else if (lookahead can start n'th rule for N) {
match rule n for N
else {
error();
}
}
Unfortunately, there are some problems with this simple scheme. A grammar
rule is said to be directly left recursive if the first
symbol on the RHS is identical to the LHS nonterminal. Hence, after a directly
left recursive rule has been selected, the first action of the corresponding
parsing procedure, will be to call itself
immediately, without consuming any of the input. It should
be clear that such a recursive call will never terminate. Hence a
recursive descent parser cannot be written for a grammar which
contains such directly (or indirectly) left recursive rules; in fact,
the grammar cannot be LL(1) in the presence of such rules.
Fortunately, it is possible to transform
the grammar to remove left recursive rules. Consider the
left recursive nonterminal A defined by the following
rules:
A: A alpha
A: beta
where alpha is nonempty and alpha and beta
stand for any sequence of terminal and nonterminal grammar symbols. Looking
at the above rules, it is clear that any string derived from A
must ultimately start with beta. After the beta, the
rest of the string must either be empty or consist of a sequence
of one or more alpha's. Using a new nonterminal A'
to represent the rest of the string we can represent the transformed
grammar as follows:
A: beta A'
A':/* empty */
A': alpha A'
The above rules are not left recursive.
An Example
Consider the following simple grammar for assignment statements with
expressions involving addition or subtraction:
1) stmt: ID ':=' expr
2) expr: expr '+' NUM
3) expr: expr '-' NUM
4) expr: NUM
This grammar allows statements like a:= 1 + 2 - 3. Since the
rules for expr are left recursive, we need to transform
the grammar to the following:
1) stmt: ID ':=' expr
2) expr: NUM exprRest
3) exprRest: '+' NUM exprRest
4) exprRest: '-' NUM exprRest
5) exprRest: /* empty */
Assuming that the current lookahead is in a global variable tok,
we can write the parsing procedures in pseudo-C as follows:
stmt() {
match(ID); match(':='); expr();
}
expr() {
match(NUM); exprRest();
}
exprRest() {
if (tok == '+') { /* rule 3 */
match('+'); match(NUM); exprRest();
}
else if (tok == '-') { /* rule 4 */
match('-'); match(NUM); exprRest();
} else {
/* rule 5 */
}
}
Note that exprRest() chooses rule 5 if the lookahead cannot
match any of the other rules for exprRest. It would have also
been correct to choose rule 5 only if the current lookahead was a token
which can follow exprRest, but the method used above
is simpler with the sole disadvantage that errors are detected later.
The match() procedure merely needs to check if the current
lookahead in tok matches its argument and advance the input:
match(Token t) {
if (tok == t) {
tok= nextToken();
} else {
error():
}
}
where we assume that nextToken() is the interface to the scanner
and returns the next token from the input stream.
All that remains is to make sure we startup and terminate the parser
correctly:
parse() {
tok= nextToken();
stmt();
match('<EOF>');
}
The first line merely primes the global variable used for the lookahead
token. The second line calls the parsing procedure for stmt.
Finally, assuming that '<EOF>' is the token used to represent
the end of the input stream, the last line ensures that no garbage follows
a legal stmt.
Note: If you use Perl for compiler writing there is
Parse::RecDescent: A versatile recursive descent
Perl module.
I've recently being trying to teach myself
how parsers (for languages/context-free
grammars) work, and most of it seems to
be making sense, except for one thing. I'm
focusing my attention in particular on
LL(k) grammars, for which the two
main algorithms seem to be the
LL parser (using stack/parse
table) and the
Recursive Descent parser
(simply using recursion). As far as I
can see, the recursive descent algorithm
works on all LL(k) grammars and possibly
more, whereas an LL parser works on all
LL(k) grammars. A recursive descent parser
is clearly much simpler than an LL parser
to implement, however (just as an LL one
is simply than an LR one).
So my question is, what are the advantages/problems
one might encounter when using either of
the algorithms? Why might one ever pick
LL over recursive descent, given that it
works on the same set of grammars and is
trickier to implement?
Hopefully this question makes some amount
of sense. Sorry if it doesn't - I blame
my the fact that this entire subject is
almost entirely new to me.
|
|
|
asked
Jun 25 '09 at 15:32
91% accept rate
|
|
| |
|
|
|
LL is usually a more efficient parsing technique
than recursive-descent. In fact, a naive
recursive-descent parser will actually be
O(k^n) (where n is the
input size) in the worst case. Some techniques
such as memoization (which yields a
Packrat parser) can improve this as
well as extend the class of grammars accepted
by the parser, but there is always a space
tradeoff. LL parsers are (to my knowledge)
always linear time. On the flip side,
you are correct in your intuition that recursive-descent
parsers can handle a greater class of grammars
than LL. Recursive-descent can handle any
grammar which is LL(*) (that is, unlimited
lookahead) as well as a small set of ambiguous
grammars. This is because recursive-descent
is actually a directly-encoded implementation
of PEGs, or
Parser Expression Grammar(s). Specifically,
the disjunctive operator (a | b)
is not commutative, meaning that a
| b does not equal b | a.
A recursive-descent parser will try each
alternative in order. So if a
matches the input, it will succede even
if b would have matched
the input. This allows classic "longest
match" ambiguities like the dangling
else problem to be handled
simply by ordering disjunctions correctly.
With all of that said, it is possible
to implement an LL(k) parser using recursive-descent
so that it runs in linear time. This is
done by essentially inlining the predict
sets so that each parse routine determines
the appropriate production for a given input
in constant time. Unfortunately, such a
technique eliminates an entire class of
grammars from being handled. Once we get
into predictive parsing, problems like dangling
else are no longer solvable
with such ease.
As for why LL would be chosen over recursive-descent,
it's mainly a question of efficiency and
maintainability. Recursive-descent parsers
are markedly easier to implement, but they're
usually harder to maintain since the grammar
they represent does not exist in any declarative
form. Most non-trivial parser use-cases
employ a parser generator such as ANTLR
or Bison. With such tools, it really doesn't
matter if the algorithm is directly-encoded
recursive-descent or table-driven LL(k).
As a matter of interest, it is also worth
looking into
recursive-ascent, which is a parsing
algorithm directly encoded after the fashion
of recursive-descent, but capable of handling
any LALR grammar. I would also dig into
parser combinators, which are a functional
way of composing recursive-descent parsers
together.
|
|
answered
Jun 25 '09 at 15:45
|
|
| |
|
Creating a Recursive-Descent Parser
A Grammar, G, is a structure <N,T,P,S> where N is a set
of non-terminals, T is a set of terminals, P is a set of productions,
and S is a special non-terminal called the start symbol of the grammar.
Grammars are used to formally specify the syntax of a language. For
example, consider the language of calculator expressions where we can
add, subtract, multiply, and divide. We can also store a value in a
single memory location and recall the value in the memory location.
For example, (5S+4)* R EOF is a sentence in this grammar. When
evaluated, this expression would yield the value 45. A grammar that
specifies the syntax of this language is:
G=<N,T,P,Prog> where N={Prog,Expr,Term,Storable,Factor},
T={number,+,-,*,/,S,R,(,),EOF} the start symbol is Prog,
and the set of productions is given by these rules
Prog -> Expr EOF
Expr -> Expr + Term | Expr - Term | Term
Term -> Term * Storable | Term / Storable
| Storable
Storable -> Factor S | Factor
Factor -> number | R | ( Expr )
Using this grammar we can check to see that the example expression
given above is valid by constructing a derivation of the sentence. A
derivation starts with the start symbol and replace one non-terminal
at each step to generate the sentence. For instance, a derivation that
generates the example above is:
Prog => Expr EOF =>Term EOF =>Term * Storable
EOF => Storable * Storable EOF =>Factor * Storable EOF => ( Expr ) *
Storable =>EOF ( Expr + Term ) * Storable EOF => ( Term + Term ) * Storable
EOF => ( Storable + Term ) * Storable EOF => ( Factor S + Term ) * Storable
EOF => ( 5 S + Term ) * Storable EOF => ( 5 S + Storable ) * Storable
EOF => ( 5 S + Factor ) * Storable EOF => ( 5 S + Factor ) * Storable
EOF => ( 5 S + 4 ) * Storable EOF => ( 5 S + 4 ) * Factor EOF => ( 5
S + 4 ) * R EOF
This derivation proves that (5S+4)*R EOF is a syntactically
valid expression in the language of the grammar. The grammar presented
above is what is called an LALR(1) grammar. This kind of grammar can
be used by a program called a parser generator that given the grammar,
will automatically generate a program called a parser. A parser is a
program that given a sentence in its language, will construct a derivation
of that sentence to check it for syntactic correctness. The derivation
can also be expressed in tree form, called a parse tree. There may be
many different derivations for a sentence in a language, but only one
parse tree (if the grammar is unambiguous). Parse trees are outside
the scope of this document. Take CS67 if you want to learn about parse
trees.
Although parsers can be generated by parser generators, it is still
sometimes convenient to write a parser by hand. However, LALR(1) grammars
are not easy to use to manually construct parsers. Instead, we want
an LL(1) grammar if we are going to manually construct a parser. An
LL(1) grammar can be used to construct a top-down or recursive
descent parser where an LALR(1) grammar is typically used to
construct a bottom-up parser. A top-down parser constructs (or at least
traverses) the parse tree starting at the root of the tree and proceeding
downward. A bottom-up parser constructs or traverses the parse tree
in a bottom-up fashion. Again, for more details, take CS67.
In a recursive descent parser, each non-terminal in
the grammar becomes a function in a program. The right hand side of
the productions become the bodies of the functions. An LALR(1) grammar
is not appropriate for constructing a recursive descent
parser. To create a recursive-descent parser (the topic
of this page) we must convert the LALR(1) grammar above to an LL(1)
grammar. Typically, there are two steps involved.
- Eliminate left recursion
- Perform left factorization where appropriate
Eliminate Left Recursion
Eliminating left recursion means eliminating rules
like Expr ::= Expr + Term. Rules like this are left
recursive because the Expr function would first call the
Expr function in a recursive descent parser. Without
a base case first, we are stuck in infinite recursion (a bad thing).
To eliminate left recursion we look to see what Expr can
be rewritten as. In this case, Expr can be only be replaced by
a Term so we replace Expr with Term in the productions.
The usual way to eliminate left recursion
is to introduce a new non-terminal to handle all but the first part
of the production. So we get
Expr -> Term RestExpr
RestExpr -> + Term RestExpr
| - Term RestExpr | <null>
We must also eliminate left recursion in the Term Term * Factor
| Term / Factor productions in the same way. We end up with an
LL(1) grammar that looks like this:
Prog -> Expr EOF
Expr -> Term RestExpr
RestExpr -> + Term RestExpr
| - Term RestExpr | <null>
Term -> Storable RestTerm
RestTerm -> * Storable RestTerm
| / Storable RestTerm | <null>
Storable -> Factor S | Factor
Factor -> number | R | ( Expr
)
Perform Left Factorization
Left factorization isn't needed on this grammar so this step is skipped.
Left factorization is needed when the first part of two or more productions
is the same and the rest of the similar productions are different. Left
factorization is important in languages like Prolog because without
it the parser is inefficient. However, it isn't needed and does not
improve efficiency when writing a parser in an imperative language like
Java, for instance.
Translating the LL(1) Grammar to Java
Typically, a parser returns an abstract syntax tree (or expression tree)
representing the expression or sentence being parsed. An abstract syntax
is defined by a grammar that is likely ambiguous. The ambiguity is not
a problem for abstract syntax since this grammar will not be used for
parsing.
Once you have an LL(1) grammar you use it to build a parser as follows.
The following construction causes the parser to return an abstract syntax
tree or expression tree for the sentence being parsed.
- Construct a function for each non-terminal. Each of these functions
should return a node in the abstract syntax tree.
- Depending on your grammar, some non-terminal functions may require
an input parameter of an abstract syntax tree (ast) to be able to
complete a partial ast that is recognized by the non-terminal function.
- Each non-terminal function should call a function to get the
next token as needed. If the next token is not needed, the non-terminal
function should call another function to put back the token. Your
parser (if it is based on an LL(1) grammar, should never have to
get or put back more than one token at a time.
- The body of each non-terminal function should be be a series
of if statements that choose which production right-handside to
expand depending on the value of the next token.
The construction above is very simple, but can be confusing without
an example. Consider the LL(1) grammar given above.
Assume that you have a binary tree class in Java called BTNode that
can hold an Object and that operators in the expression are Characters
and numbers in the expression are Doubles. Also, assume there are two
functions that get the next Token in the input (getToken) and push the
last token back on the input (pushBackToken). Then, we can write the
first part of the parser as follows:
private BTNode Prog() {
BTNode tmp = Expr();
if (!getNextToken().toString().equals("$"))
throw new IllegalStateException("Error in expression");
return tmp;
}
private BTNode Expr() {
return RestExpr(Term());
}
private BTNode RestExpr(BTNode E1) {
Object on = getNextToken();
if (on.toString().equals("+") || on.toString().equals("-"))
return RestExpr(new BTNode(on,E1,Term()));
pushBackToken();
return E1;
}
The code given here throws an exception if an error is discovered during
parsing. You should of course take appropriate action during
error conditions. The BTNode constructor for the tree above takes a
value for the new node and references to the left and right subtrees.
Calling the function Prog() returns a pointer to an abstract syntax
tree representing the expression.
In this post, I present the start of a recursive descent
parser that will parse the simple grammar that I presented previously
in this series. I’ll point out some key features of the code so
that it is easy to see how the code works.
This post is one in a series on using LINQ to write a recursive-descent
parser for SpreadsheetML formulas.
-
Writing a Recursive Descent Parser using C# and LINQ
-
Recursive Descent Parser using LINQ: The Augmented
Backus-Naur Form Grammar
-
Recursive Descent Parser: A Simple Grammar
-
Creating a Collection from Singletons and Collections using LINQ
-
Building a Simple Recursive Descent Parser
-
Building a Simple Recursive Descent Parser (continued)
-
Building a Simple Recursive Descent Parser (Completed
Simple Parser)
The Symbol Class
We need to write a class for each symbol (production) in the grammar.
Most of these classes have a method named Produce that given a collection
of Symbol objects, produces an instance of that symbol. Each Produce
method in the various classes is free to return null if the collection
that is passed to the method can’t produce that symbol.
We need to define an abstract base class for each of these classes.
Most instances of classes that derive from the abstract base class will
contain a list of constituent symbols, so the abstract base class includes
a public property, ConstituentSymbols, of type List<Symbol>.
The ToString method for each of these classes returns the string
representation of the symbol. If a symbol is comprised of a list
of constituent symbols (i.e. is not a terminal), then the ToString method
returns the concatenated strings of the constituent symbols.
There is a constructor for Symbol that takes a params array of Object.
Each element in the params array can either be an instance of Symbol,
or an object that implements IEnumerable<Symbol>. Those are the
only valid types in the params array. The constructor throws an
internal ParserException if anything other than one of those types is
passed as an argument. I explained this idiom in the previous
post in this series,
Creating a Collection from Singletons and Collections using LINQ.
This method also uses the StringConcatenate extension method, which
I discussed in
this topic in the LINQ to XML documentation.
public abstract class Symbol
{
public List<Symbol>
ConstituentSymbols { get; set; }
public override string
ToString() {
string s = ConstituentSymbols.Select(ct
=> ct.ToString()).StringConcatenate();
return s;
}
public Symbol(params
Object[] symbols)
{
List<Symbol>
ls = new List<Symbol>();
foreach (var
item in symbols)
{
if (item is Symbol)
ls.Add((Symbol)item);
else if (item is IEnumerable<Symbol>)
foreach (var item2 in (IEnumerable<Symbol>)item)
ls.Add(item2);
else
// If this error is thrown, the parser is coded incorrectly.
throw new ParserException("Internal
error");
}
ConstituentSymbols = ls;
}
public Symbol() { }
}
The OpenParenthesis Class
There are two varieties of symbols – terminal, and non-terminal.
I discussed both types of symbols in the post on the
Augmented Backus-Naur Form Grammar. A subclass of the Symbol
class for terminal symbols is super simple. It contains an override
of the ToString method, which returns the string of the terminal.
Due to the semantics of C#, it also needs to include a constructor with
no parameters:
public class OpenParenthesis
: Symbol
{
public override string
ToString() { return "("; }
public OpenParenthesis() { }
}
All of the terminals except for DecimalDigit look like the OpenParenthesis
class. DecimalDigit is similar, except that its constructor takes
a character, and the ToString method returns that character. To
make the code as succinct as possible, digits are stored as strings
instead of characters.
public class DecimalDigit
: Symbol
{
private string CharacterValue;
public override string
ToString() { return CharacterValue; }
public DecimalDigit(char c)
{ CharacterValue = c.ToString(); }
}
The Formula Class
A derived class for non-terminal symbols is also pretty simple.
There is one public method, Produce, which takes a collection of Symbol
objects, and then produces an instance of the class if it can; otherwise
it returns null.
Recursive descent parsers are ‘top-down’ parsers, so
it makes sense to define the Formula class first. If you examine
the
simple grammar that I defined, you can see that a formula is comprised
of an expression, so the Formula.Produce method simply passes on the
collection of Symbol objects to the Expression.Produce method.
Again, due to the semantics of C#, it is necessary to define the constructor
that takes a params array of Object, but it doesn’t need to do anything
other than call the constructor in the base class.
class Formula : Symbol
{
public static Formula
Produce(IEnumerable<Symbol> symbols)
{
// formula = expression
Expression e =
Expression.Produce(symbols);
return e ==
null ? null : new Formula(e);
}
public Formula(params
Object[] symbols) : base(symbols) { }
}
The Expression Class
The Expression class is more interesting. The grammar production
for Expression is as follows:
expression = *whitespace nospace-expression *whitespace
This means that an expression consists of zero or more WhiteSpace
symbols, followed by one and only one NospaceExpression, followed by
zero or more WhiteSpace symbols. Using LINQ, it is super easy
to count the WhiteSpace symbols at the beginning and end of the collection
of symbols. The Expression.Produce method can then pass the symbols
in the middle (between the white space) to NospaceExpression.Produce.
If that method returns a NospaceExpression object, then the method can
instantiate an Expression object and return it. The Expression.Process
method makes use of the
SkipLast extension method. The Expression class looks like
this:
public class Expression
: Symbol
{
public static Expression
Produce(IEnumerable<Symbol> symbols)
{
// expression = *whitespace
nospace-expression *whitespace
int whiteSpaceBefore
= symbols.TakeWhile(s => s is WhiteSpace).Count();
int whiteSpaceAfter
= symbols.Reverse().TakeWhile(s => s is WhiteSpace).Count();
IEnumerable<Symbol>
noSpaceSymbolList = symbols
.Skip(whiteSpaceBefore)
.SkipLast(whiteSpaceAfter)
.ToList();
NospaceExpression
n = NospaceExpression.Produce(noSpaceSymbolList);
if (n != null)
return new Expression(
Enumerable.Repeat(new WhiteSpace(),
whiteSpaceBefore),
n,
Enumerable.Repeat(new WhiteSpace(),
whiteSpaceAfter));
return null;
}
public Expression (params Object[] symbols) : base(symbols) { }
}
I follow the same pattern for every class – the grammar is included
in a comment, followed by a bit of LINQ code to produce an instance
of the class, returning null if necessary.
Let’s look at the code to instantiate the Expression object:
return new Expression(
Enumerable.Repeat(new WhiteSpace(), whiteSpaceBefore),
n,
Enumerable.Repeat(new WhiteSpace(), whiteSpaceAfter));
The arguments in yellow are collections of Symbol. The parameter
n, which is of type NospaceExpression, is a singleton, so we’re taking
advantage of the ability to
pass either singletons or collections to the Expression constructor.
The NospaceExpression Class
To keep this first example as simple as possible, we’ll implement
a dummy NospaceExpression class. We’ll pretend that any collection
of symbols is a valid NospaceExpression symbol. In the next post,
I’ll examine how we need to code the NospaceExpression class, as well
as classes for some of the other symbols.
public class NospaceExpression
: Symbol
{
public static NospaceExpression
Produce(IEnumerable<Symbol> symbols)
{
return new NospaceExpression(symbols);
}
public NospaceExpression(params Object[] symbols) : base(symbols) { }
}
Projecting a Collection of Symbols from a String
One aspect of the approach that I took is that I first projected
a collection of terminal Symbol objects from the collection of characters
that make up the string being parsed. The string class implements
IEnumerable<char> so we can do a simple select, as follows:
IEnumerable<Symbol> symbols = s.Select(c
=>
{
switch (c)
{
case '0':
case '1':
case '2':
case '3':
case '4':
case '5':
case '6':
case '7':
case '8':
case '9':
return new DecimalDigit(c);
case ' ':
return new WhiteSpace();
case '+':
return new Plus();
case '-':
return new Minus();
case '*':
return new Asterisk();
case '/':
return new ForwardSlash();
case '^':
return new Caret();
case '.':
return new FullStop();
case '(':
return new OpenParenthesis();
case ')':
return new CloseParenthesis();
default:
return (Symbol)null;
}
});
If we parse the formula " (1+3) ", and dump out the terminal
symbols, we see this:
Terminal Symbols
================
WhiteSpace > <
OpenParenthesis >(<
DecimalDigit >1<
Plus >+<
DecimalDigit >3<
CloseParenthesis >)<
WhiteSpace > <
WhiteSpace > <
By first transforming the string into a collection of terminals,
it allows us to write more expressive code. It is easier to read
this code:
If (t is WhiteSpace)
It is a bit harder to read this:
If (t is Character && t.ToString() == " " )
The DumpSymbolRecursive Method
As you can see, Symbol objects form a hierarchical tree – each symbol
(except for terminals) has constituent symbols. This allows us
to write a recursive method to dump symbols to a StringBuilder
object.
public static void DumpSymbolRecursive(StringBuilder
sb, Symbol symbol, int depth)
{
sb.Append(string.Format("{0}{1} >{2}<",
"".PadRight(depth
* 2),
symbol.GetType().Name.ToString(),
symbol.ToString())).Append(Environment.NewLine);
if (symbol.ConstituentSymbols != null)
foreach (var
childSymbol in symbol.ConstituentSymbols)
DumpSymbolRecursive(sb,
childSymbol, depth + 1);
}
If we parse the formula “ (1+3) ” and dump it, we see:
Formula > (1+3) <
Expression > (1+3) <
WhiteSpace > <
NospaceExpression >(1+3)<
OpenParenthesis >(<
DecimalDigit >1<
Plus >+<
DecimalDigit >3<
CloseParenthesis >)<
WhiteSpace > <
WhiteSpace > <
We can see the instances of the Formula, Expression, and NospaceExpression
classes, as well as the various terminals that make up the expression.
Complete Listing
Here is the complete listing of the first version of the SimpleFormulaParser
example. You can simply cut and paste this code into Visual Studio,
and run it. It doesn’t have any dependencies. It transforms
a formula to a collection of terminals, prints the terminals, and then
uses the Formula, Expression, and NospaceExpression classes to produce
a parse tree (which is incomplete because NospaceExpression is a dummy
implementation).
In the next post in this series, I’ll continue to develop this simple
parser.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace SimpleParser
{
public abstract class Symbol
{
public List<Symbol>
ConstituentSymbols { get; set; }
public override string ToString() {
string s = ConstituentSymbols.Select(ct => ct.ToString()).StringConcatenate();
return s;
}
public Symbol(params Object[] symbols)
{
List<Symbol> ls = new List<Symbol>();
foreach (var item in symbols)
{
if (item is Symbol)
ls.Add((Symbol)item);
else if (item is IEnumerable<Symbol>)
foreach (var item2 in (IEnumerable<Symbol>)item)
ls.Add(item2);
else
// If this error is thrown, the parser is coded incorrectly.
throw new ParserException("Internal
error");
}
ConstituentSymbols
= ls;
}
public Symbol()
{ }
}
public class Formula
: Symbol
{
public static Formula Produce(IEnumerable<Symbol>
symbols)
{
// formula = expression
Expression e = Expression.Produce(symbols);
return e == null ? null : new Formula(e);
}
public Formula(params Object[] symbols) : base(symbols) { }
}
public class Expression
: Symbol
{
public static Expression Produce(IEnumerable<Symbol>
symbols)
{
// expression = *whitespace nospace-expression *whitespace
int whiteSpaceBefore = symbols.TakeWhile(s => s is WhiteSpace).Count();
int whiteSpaceAfter = symbols.Reverse().TakeWhile(s => s
is WhiteSpace).Count();
IEnumerable<Symbol> noSpaceSymbolList = symbols
.Skip(whiteSpaceBefore)
.SkipLast(whiteSpaceAfter)
.ToList();
NospaceExpression n = NospaceExpression.Produce(noSpaceSymbolList);
if (n != null)
return new Expression(
Enumerable.Repeat(new WhiteSpace(),
whiteSpaceBefore),
n,
Enumerable.Repeat(new WhiteSpace(),
whiteSpaceAfter));
return null;
}
public Expression
(params Object[] symbols) : base(symbols)
{ }
}
public class NospaceExpression
: Symbol
{
public static NospaceExpression Produce(IEnumerable<Symbol>
symbols)
{
return new NospaceExpression(symbols);
}
public NospaceExpression(params Object[] symbols) : base(symbols) { }
}
public class DecimalDigit
: Symbol
{
private string
CharacterValue;
public override string ToString() { return CharacterValue;
}
public DecimalDigit(char
c) { CharacterValue = c.ToString(); }
}
public class WhiteSpace
: Symbol
{
public override string ToString() { return " ";
}
public WhiteSpace()
{ }
}
public class Plus
: Symbol
{
public override string ToString() { return "+";
}
public Plus() {
}
}
public class Minus
: Symbol
{
public override string ToString() { return "-";
}
public Minus() {
}
}
public class Asterisk
: Symbol
{
public override string ToString() { return "*";
}
public Asterisk()
{ }
}
public class ForwardSlash
: Symbol
{
public override string ToString() { return "/";
}
public ForwardSlash()
{ }
}
public class Caret
: Symbol
{
public override string ToString() { return "^";
}
public Caret() {
}
}
public class FullStop
: Symbol
{
public override string ToString() { return ".";
}
public FullStop()
{ }
}
public class OpenParenthesis
: Symbol
{
public override string ToString() { return "(";
}
public OpenParenthesis()
{ }
}
public class CloseParenthesis
: Symbol
{
public override string ToString() { return ")";
}
public CloseParenthesis()
{ }
}
public class SimpleFormulaParser
{
public static Symbol ParseFormula(string s)
{
IEnumerable<Symbol> symbols = s.Select(c =>
{
switch (c)
{
case '0':
case '1':
case '2':
case '3':
case '4':
case '5':
case '6':
case '7':
case '8':
case '9':
return new DecimalDigit(c);
case ' ':
return new WhiteSpace();
case '+':
return new Plus();
case '-':
return new Minus();
case '*':
return new Asterisk();
case '/':
return new ForwardSlash();
case '^':
return new Caret();
case '.':
return new FullStop();
case '(':
return new OpenParenthesis();
case ')':
return new CloseParenthesis();
default:
return (Symbol)null;
}
});
#if true
if (symbols.Any())
{
Console.WriteLine("Terminal Symbols");
Console.WriteLine("================");
foreach (var terminal in symbols)
Console.WriteLine("{0} >{1}<", terminal.GetType().Name.ToString(),
terminal.ToString());
Console.WriteLine();
}
#endif
Formula formula = Formula.Produce(symbols);
if (formula == null)
throw new ParserException("Invalid
formula");
return formula;
}
public static void DumpSymbolRecursive(StringBuilder sb,
Symbol symbol, int depth)
{
sb.Append(string.Format("{0}{1}
>{2}<",
"".PadRight(depth * 2),
symbol.GetType().Name.ToString(),
symbol.ToString())).Append(Environment.NewLine);
if (symbol.ConstituentSymbols != null)
foreach (var childSymbol in symbol.ConstituentSymbols)
DumpSymbolRecursive(sb, childSymbol, depth + 1);
}
}
class Program
{
static void
Main(string[] args)
{
StringBuilder sb = new StringBuilder();
Symbol f = SimpleFormulaParser.ParseFormula("
(1+3) ");
SimpleFormulaParser.DumpSymbolRecursive(sb, f, 0);
Console.WriteLine(sb.ToString());
}
}
public class ParserException
: Exception
{
public ParserException(string
message) : base(message) { }
}
public static class MyExtensions
{
public static IEnumerable<T> SkipLast<T>(this IEnumerable<T>
source,
int count)
{
Queue<T> saveList = new Queue<T>();
int saved = 0;
foreach (T item in source)
{
if (saved < count)
{
saveList.Enqueue(item);
++saved;
continue;
}
saveList.Enqueue(item);
yield return saveList.Dequeue();
}
yield break;
}
public static string StringConcatenate(this IEnumerable<string>
source)
{
StringBuilder sb = new StringBuilder();
foreach (string s in source)
sb.Append(s);
return sb.ToString();
}
public static string StringConcatenate<T>(
this IEnumerable<T> source,
Func<T, string> projectionFunc)
{
return source.Aggregate(new StringBuilder(),
(s, i) => s.Append(projectionFunc(i)),
s => s.ToString());
}
}
}
To learn how recursive descent parsers work, it is helpful to implement
a very simple grammar, so for pedagogical purposes, I’ve defined a grammar
for simple arithmetic expressions. The parser will construct a syntax
tree from expressions that we can then examine as necessary. Just for
fun, after implementing the parser, we will write a small method that
will evaluate the formulas.
This post is one in a series on using LINQ to write a recursive-descent
parser for SpreadsheetML formulas.
-
Writing a Recursive Descent Parser using C# and LINQ
-
Recursive Descent Parser using LINQ: The Augmented Backus-Naur Form
Grammar
-
Recursive Descent Parser: A Simple Grammar
-
Creating a Collection from Singletons and Collections using LINQ
-
Building a Simple Recursive Descent Parser
-
Building a Simple Recursive Descent Parser (continued)
-
Building a Simple Recursive Descent Parser (Completed Simple Parser)
In these expressions, operands are floating point numbers, but for
simplicity, I’ve eliminated the ability to have exponents. Floating
point numbers are the only variety of operand; to demonstrate how to
write the parser, it’s not necessary to include the idea of variables
or other types of operands. When writing the parser for Excel formulas,
we’ll need to deal with both exponents and other types of operands,
as well as many more varieties of issues.
Wikipedia
Simple Top-Down
Parsing in Python
YARD (Yet Another Recursive Descent
Parser) by Christopher Diggins
The Spirit recursive
descent parser compiler
lecture06
Recursive Descent Parsing
Recursive descent parsing
CS 3721 -- Recursive Descent Parsing
Javanotes 5.1.2, Section 9.5 -- A Simple Recursive Descent Parser
3.2.1 Recursive Descent Parsing
Building a Simple Recursive Descent Parser - Eric White's Blog ...
Most popular humor pages:
Manifest of the Softpanorama IT Slacker Society
: Ten Commandments of the IT Slackers Society
: Computer Humor Collection : BSD Logo
Story : The Cuckoo's Egg : C++ Humor
:
ARE YOU A BBS ADDICT? : Object oriented programmers of all nations :
C Humor :
Financial Humor :
Financial Humor Bulletin, 2008 :
Financial Humor Bulletin, 2010 :
Richard Stallman Related Humor : Admin Humor
: Perl-related Humor :
Linus Torvalds Related humor :
PseudoScience Related Humor : Networking Humor :
Shell Humor:
Financial Humor Bulletin, 2011 :
Financial Humor Bulletin, 2012 :
Financial Humor Bulletin, 2013 :
Java Humor : Software Engineering Humor :
Sun Solaris Related Humor : The Most Comprehensive Collection of
Editor-related Humor : Microsoft plans to buy Catholic Church :
Education Humor : IBM Humor :
Assembler-related Humor : VIM Humor
Computer Viruses Humor : Bright tomorrow is
rescheduled to a day after tomorrow : Classic Computer Humor :
Best Russian Programmer Humor :
Russian Musical Humor :
The Perl Purity Test :
Politically Incorrect Humor : GPL-related
Humor : OFM Humor : IDS Humor :
Real Programmers Humor : Scripting
Humor : Web Humor : Programming Language
Humor :
Goldman Sachs related humor :
Greenspan
humor :
The Last
but not Least
Copyright © 1996-2013 by Dr. Nikolai Bezroukov. www.softpanorama.org
was created as a service to the UN Sustainable Development Networking Programme (SDNP) in the author free time. This document
is an industrial compilation designed and created exclusively for educational use and is distributed under the
Softpanorama Content License. Site uses AdSense so you need to be aware of Google privacy policy.
Original materials copyright belong to respective owners. Quotes are made for educational purposes
only in compliance with the fair use doctrine. This is a Spartan WHYFF (We Help You For Free) site written
by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain
some broken links as it develops like a living tree...
Disclaimer:
The statements, views and opinions presented on this web page are those of the author and are not endorsed by, nor
do they necessarily reflect, the opinions of the author present and former employers, SDNP or any other organization the
author may be associated with. We do not warrant the correctness of the information provided or its fitness for any purpose.
Last modified:
January 15, 2013