Softpanorama

May the source be with you, but remember the KISS principle ;-)
Home Switchboard Unix Administration Red Hat TCP/IP Networks Neoliberalism Toxic Managers
(slightly skeptical) Educational society promoting "Back to basics" movement against IT overcomplexity and  bastardization of classic Unix

Recursive Descent Parsing

News Compilers Best books Recommended Links Lexical analysis A symbol table Regular expressions  
YACC XPL LEX Flex     Humor Etc

This is just a brief  introduction to recursive descent parsing. A good compiler textbook provides substantially more details. The problem we need to solve is as following: Given a grammar in BNF, write a parser for it. One of the simplest but at the same time very robust approach would proceed as follows:

This is the most "programmer-friendly" method of syntax analysis and it you design a language it make sense to put some preliminary transformation that allow to use this parsing strategy. It extremely well integrates with lexical analyzer both in "one a time form" or a separate pass" form.

A lot can be done on lexical analysis level.  Pascal was the first language deliberately limited to the grammar that can be parsed using recursive decent. While in this are it went overboard it was a pretty successful language so there is no real ground to suspect that limiting yourself to recursive decent parser you limit the attractiveness of your language.

We know that all strings in the language must be derived from the start nonterminal S. Hence if we can write a procedure S() which matches all strings derived from S, then we are done. But how do we do this?

Let's generalize the above question somewhat. Instead of merely posing the problem of writing a procedure which matches strings derived from the specific start nonterminal S, let's look at the more general problem of writing a procedure which matches strings derived from any nonterminal N; i.e., for each nonterminal N, we would like to construct a parsing procedure N() which matches all strings derived from N.

If a nonterminal N has only a single grammar rule, then its parsing procedure is easy to write. To match the rule we know that we need to match each grammar symbol on its right-hand side (RHS). There are two possibilities:

  1. If the RHS symbol is a terminal symbol, check that the current lookahead matches that terminal symbol. If it does, then advance the input, setting the current lookahead to the next input symbol. If it does not, signal an error.
  2. For each occurrence of a nonterminal symbol on the RHS, call the corresponding parsing procedure. If the parsing procedure works as posited above, then it will match a prefix of the input with a rule for the nonterminal.
If a nonterminal N has multiple grammar rules, then the parsing procedure will need to decide which rule to use. It can do so by looking at the current lookahead token to see which of the candidate rules can start with the lookahead. If only a single rule can start with the lookahead, then that rule is chosen. If it is always possible to predict a unique rule for any nonterminal by looking ahead by at most one token, then the grammar is said to be LL(1).

Hence a parsing function N() for nonterminal N will simply contain logic to pick a rule based on the current lookahead, followed by code to match each individual rule. Its structure will look something like the following:

	N() {
	  if (lookahead can start first rule for N) {
	    match rule 1 for N
	  }
	  else if (lookahead can start second rule for N) {
	    match rule 2 for N
	  }
	  ...		...
          else if (lookahead can start n'th rule for N) {
	    match rule n for N
	  else {
	    error();
	  }
	}

Unfortunately, there are some problems with this simple scheme. A grammar rule is said to be directly left recursive if the first symbol on the RHS is identical to the LHS nonterminal. Hence, after a directly left recursive rule has been selected, the first action of the corresponding parsing procedure, will be to call itself immediately, without consuming any of the input. It should be clear that such a recursive call will never terminate. Hence a recursive descent parser cannot be written for a grammar which contains such directly (or indirectly) left recursive rules; in fact, the grammar cannot be LL(1) in the presence of such rules.

Fortunately, it is possible to transform the grammar to remove left recursive rules. Consider the left recursive nonterminal A defined by the following rules:

	A: A alpha
	A: beta
where alpha is nonempty and alpha and beta stand for any sequence of terminal and nonterminal grammar symbols. Looking at the above rules, it is clear that any string derived from A must ultimately start with beta. After the beta, the rest of the string must either be empty or consist of a sequence of one or more alpha's. Using a new nonterminal A' to represent the rest of the string we can represent the transformed grammar as follows:
	A: beta A'
	A':/* empty */
	A': alpha A'
The above rules are not left recursive.

An Example

Consider the following simple grammar for assignment statements with expressions involving addition or subtraction:

1)	stmt:	ID ':=' expr
2)	expr:	expr '+' NUM
3)	expr:	expr '-' NUM
4)	expr:	NUM
This grammar allows statements like a:= 1 + 2 - 3. Since the rules for expr are left recursive, we need to transform the grammar to the following:
1)	stmt:		ID ':=' expr
2)	expr:		NUM exprRest
3)	exprRest:	'+' NUM exprRest
4)	exprRest:	'-' NUM exprRest
5)	exprRest:	/* empty */
Assuming that the current lookahead is in a global variable tok, we can write the parsing procedures in pseudo-C as follows:
stmt() {
  match(ID); match(':='); expr();
}

expr() {
  match(NUM); exprRest();
}

exprRest() {
  if (tok == '+') { /* rule 3 */
    match('+'); match(NUM); exprRest();
  }
  else if (tok == '-') { /* rule 4 */
    match('-'); match(NUM); exprRest();
  } else { 
     /* rule 5 */
  }
}
Note that exprRest() chooses rule 5 if the lookahead cannot match any of the other rules for exprRest. It would have also been correct to choose rule 5 only if the current lookahead was a token which can follow exprRest, but the method used above is simpler with the sole disadvantage that errors are detected later.

The match() procedure merely needs to check if the current lookahead in tok matches its argument and advance the input:

match(Token t) {
  if (tok == t) {
    tok= nextToken();
  } else {
    error():
  }
}
where we assume that nextToken() is the interface to the scanner and returns the next token from the input stream.

All that remains is to make sure we startup and terminate the parser correctly:

parse() {
  tok= nextToken();
  stmt();
  match('<EOF>');
}
The first line merely primes the global variable used for the lookahead token. The second line calls the parsing procedure for stmt. Finally, assuming that '<EOF>' is the token used to represent the end of the input stream, the last line ensures that no garbage follows a legal stmt.

Note: If you use Perl for compiler writing there is Parse::RecDescent: A versatile recursive descent Perl module.


Top Visited
Switchboard
Latest
Past week
Past month

NEWS CONTENTS

Old News

parsing - Difference between an LL and Recursive Descent parser - Stack Overflow

I've recently being trying to teach myself how parsers (for languages/context-free grammars) work, and most of it seems to be making sense, except for one thing. I'm focusing my attention in particular on LL(k) grammars, for which the two main algorithms seem to be the LL parser (using stack/parse table) and the Recursive Descent parser (simply using recursion).

As far as I can see, the recursive descent algorithm works on all LL(k) grammars and possibly more, whereas an LL parser works on all LL(k) grammars. A recursive descent parser is clearly much simpler than an LL parser to implement, however (just as an LL one is simply than an LR one).

So my question is, what are the advantages/problems one might encounter when using either of the algorithms? Why might one ever pick LL over recursive descent, given that it works on the same set of grammars and is trickier to implement?

Hopefully this question makes some amount of sense. Sorry if it doesn't - I blame my the fact that this entire subject is almost entirely new to me.

link|flag edited May 9 at 6:12

KennyTM
82.5k899214

asked Jun 25 '09 at 15:32

Noldorin
31.9k349109


91% accept rate

1 Answer

active newest votes
up vote 24 down vote accepted LL is usually a more efficient parsing technique than recursive-descent. In fact, a naive recursive-descent parser will actually be O(k^n) (where n is the input size) in the worst case. Some techniques such as memoization (which yields a Packrat parser) can improve this as well as extend the class of grammars accepted by the parser, but there is always a space tradeoff. LL parsers are (to my knowledge) always linear time.

On the flip side, you are correct in your intuition that recursive-descent parsers can handle a greater class of grammars than LL. Recursive-descent can handle any grammar which is LL(*) (that is, unlimited lookahead) as well as a small set of ambiguous grammars. This is because recursive-descent is actually a directly-encoded implementation of PEGs, or Parser Expression Grammar(s). Specifically, the disjunctive operator (a | b) is not commutative, meaning that a | b does not equal b | a. A recursive-descent parser will try each alternative in order. So if a matches the input, it will succede even if b would have matched the input. This allows classic "longest match" ambiguities like the dangling else problem to be handled simply by ordering disjunctions correctly.

With all of that said, it is possible to implement an LL(k) parser using recursive-descent so that it runs in linear time. This is done by essentially inlining the predict sets so that each parse routine determines the appropriate production for a given input in constant time. Unfortunately, such a technique eliminates an entire class of grammars from being handled. Once we get into predictive parsing, problems like dangling else are no longer solvable with such ease.

As for why LL would be chosen over recursive-descent, it's mainly a question of efficiency and maintainability. Recursive-descent parsers are markedly easier to implement, but they're usually harder to maintain since the grammar they represent does not exist in any declarative form. Most non-trivial parser use-cases employ a parser generator such as ANTLR or Bison. With such tools, it really doesn't matter if the algorithm is directly-encoded recursive-descent or table-driven LL(k).

As a matter of interest, it is also worth looking into recursive-ascent, which is a parsing algorithm directly encoded after the fashion of recursive-descent, but capable of handling any LALR grammar. I would also dig into parser combinators, which are a functional way of composing recursive-descent parsers together.

link|flag answered Jun 25 '09 at 15:45

Daniel Spiewak
13.5k2761

Very much the response I was hoping for! :) Thanks for all the info there (including last bit, of which I wasn't even aware). It will probably take a bit more reading before I understand all the concepts you've presented in this answer, but you've certainly answered my question and pointed me in the right direction for further study. The main thing I'm fuzzy abuot now is how PEGs relate to recursive descent parsers, and also how exactly a parser combinator combines various parsers. If you could clarify either/both of these, I would be very grateful. – Noldorin Jun 25 '09 at 16:04
Also, just to confirm; is "inlining the predict sets" effectively predictive parsing? In this case, what precisely is the "entire class of grammars"? – Noldorin Jun 26 '09 at 0:20
A PEG is the formal-ish description of a recursive-descent parser. Since recursive-descent isn't actually LL parsing, a different model was needed. You can kind of think of them like LALR and Bison, one being the formalism, the other being the implementation. – Daniel Spiewak Jun 26 '09 at 17:44
1
Predict sets: it really depends on the strategy used. If you exclusively rely on those predict sets and perform no backtracking, then the class of grammars is precisely LL(k), where k is the amount of lookahead used to compute the predict sets. However, you can get a lot of the benefits of predictive parsing by inlining predict sets in R-D without completely eliminating backtracking. This allows parsers which accept all grammars normally handled by R-D, but with a faster average case. Unfortunately, the worst-case for such parsers is still exponential. – Daniel Spiewak Jun 26 '09 at 17:46
Most recursive-descent parsers (even hand-written ones) will use inlined predict sets to limit the alternates, restricting backtracking without constraining flexibility. The end result is a parser which is nearly linear-time for everything but the most pathological grammars, and which still accepts the entire class of PEGs. – Daniel Spiewak

Creating a Recursive-Descent Parser

A Grammar, G, is a structure <N,T,P,S> where N is a set of non-terminals, T is a set of terminals, P is a set of productions, and S is a special non-terminal called the start symbol of the grammar. Grammars are used to formally specify the syntax of a language. For example, consider the language of calculator expressions where we can add, subtract, multiply, and divide. We can also store a value in a single memory location and recall the value in the memory location. For example, (5S+4)* R EOF is a sentence in this grammar. When evaluated, this expression would yield the value 45. A grammar that specifies the syntax of this language is:

G=<N,T,P,Prog> where N={Prog,Expr,Term,Storable,Factor}, T={number,+,-,*,/,S,R,(,),EOF} the start symbol is Prog, and the set of productions is given by these rules

Prog -> Expr EOF

Expr -> Expr + Term | Expr - Term | Term

Term -> Term * Storable | Term / Storable | Storable

Storable -> Factor S | Factor

Factor -> number | R | ( Expr )

Using this grammar we can check to see that the example expression given above is valid by constructing a derivation of the sentence. A derivation starts with the start symbol and replace one non-terminal at each step to generate the sentence. For instance, a derivation that generates the example above is:

Prog => Expr EOF =>Term EOF =>Term * Storable EOF => Storable * Storable EOF =>Factor * Storable EOF => ( Expr ) * Storable =>EOF ( Expr + Term ) * Storable EOF => ( Term + Term ) * Storable EOF => ( Storable + Term ) * Storable EOF => ( Factor S + Term ) * Storable EOF => ( 5 S + Term ) * Storable EOF => ( 5 S + Storable ) * Storable EOF => ( 5 S + Factor ) * Storable EOF => ( 5 S + Factor ) * Storable EOF => ( 5 S + 4 ) * Storable EOF => ( 5 S + 4 ) * Factor EOF => ( 5 S + 4 ) * R EOF

This derivation proves that (5S+4)*R EOF is a syntactically valid expression in the language of the grammar. The grammar presented above is what is called an LALR(1) grammar. This kind of grammar can be used by a program called a parser generator that given the grammar, will automatically generate a program called a parser. A parser is a program that given a sentence in its language, will construct a derivation of that sentence to check it for syntactic correctness. The derivation can also be expressed in tree form, called a parse tree. There may be many different derivations for a sentence in a language, but only one parse tree (if the grammar is unambiguous). Parse trees are outside the scope of this document. Take CS67 if you want to learn about parse trees.

Although parsers can be generated by parser generators, it is still sometimes convenient to write a parser by hand. However, LALR(1) grammars are not easy to use to manually construct parsers. Instead, we want an LL(1) grammar if we are going to manually construct a parser. An LL(1) grammar can be used to construct a top-down or recursive descent parser where an LALR(1) grammar is typically used to construct a bottom-up parser. A top-down parser constructs (or at least traverses) the parse tree starting at the root of the tree and proceeding downward. A bottom-up parser constructs or traverses the parse tree in a bottom-up fashion. Again, for more details, take CS67.

In a recursive descent parser, each non-terminal in the grammar becomes a function in a program. The right hand side of the productions become the bodies of the functions. An LALR(1) grammar is not appropriate for constructing a recursive descent parser. To create a recursive-descent parser (the topic of this page) we must convert the LALR(1) grammar above to an LL(1) grammar. Typically, there are two steps involved.

Eliminate Left Recursion

Eliminating left recursion means eliminating rules like Expr ::= Expr + Term. Rules like this are left recursive because the Expr function would first call the Expr function in a recursive descent parser. Without a base case first, we are stuck in infinite recursion (a bad thing). To eliminate left recursion we look to see what Expr can be rewritten as. In this case, Expr can be only be replaced by a Term so we replace Expr with Term in the productions. The usual way to eliminate left recursion is to introduce a new non-terminal to handle all but the first part of the production. So we get

Expr -> Term RestExpr

RestExpr -> + Term RestExpr | - Term RestExpr | <null>

We must also eliminate left recursion in the Term Term * Factor | Term / Factor productions in the same way. We end up with an LL(1) grammar that looks like this:

Prog -> Expr EOF

Expr -> Term RestExpr

RestExpr -> + Term RestExpr | - Term RestExpr | <null>

Term -> Storable RestTerm

RestTerm -> * Storable RestTerm | / Storable RestTerm | <null>

Storable -> Factor S | Factor

Factor -> number | R | ( Expr )

Perform Left Factorization

Left factorization isn't needed on this grammar so this step is skipped. Left factorization is needed when the first part of two or more productions is the same and the rest of the similar productions are different. Left factorization is important in languages like Prolog because without it the parser is inefficient. However, it isn't needed and does not improve efficiency when writing a parser in an imperative language like Java, for instance.

Translating the LL(1) Grammar to Java

Typically, a parser returns an abstract syntax tree (or expression tree) representing the expression or sentence being parsed. An abstract syntax is defined by a grammar that is likely ambiguous. The ambiguity is not a problem for abstract syntax since this grammar will not be used for parsing.

Once you have an LL(1) grammar you use it to build a parser as follows. The following construction causes the parser to return an abstract syntax tree or expression tree for the sentence being parsed.

  1. Construct a function for each non-terminal. Each of these functions should return a node in the abstract syntax tree.
  2. Depending on your grammar, some non-terminal functions may require an input parameter of an abstract syntax tree (ast) to be able to complete a partial ast that is recognized by the non-terminal function.
  3. Each non-terminal function should call a function to get the next token as needed. If the next token is not needed, the non-terminal function should call another function to put back the token. Your parser (if it is based on an LL(1) grammar, should never have to get or put back more than one token at a time.
  4. The body of each non-terminal function should be be a series of if statements that choose which production right-handside to expand depending on the value of the next token.
The construction above is very simple, but can be confusing without an example. Consider the LL(1) grammar given above.

Assume that you have a binary tree class in Java called BTNode that can hold an Object and that operators in the expression are Characters and numbers in the expression are Doubles. Also, assume there are two functions that get the next Token in the input (getToken) and push the last token back on the input (pushBackToken). Then, we can write the first part of the parser as follows:

     private BTNode Prog() {
       BTNode tmp = Expr();
    
       if (!getNextToken().toString().equals("$")) 
          throw new IllegalStateException("Error in expression");
    
       return tmp;
     }
    
     private BTNode Expr() {        
       return RestExpr(Term());
     }
        
     private BTNode RestExpr(BTNode E1) {
       Object on = getNextToken();
    
       if (on.toString().equals("+") || on.toString().equals("-"))
         return RestExpr(new BTNode(on,E1,Term()));
            
       pushBackToken();
            
       return E1;
     }
The code given here throws an exception if an error is discovered during parsing. You should of course take appropriate action during error conditions. The BTNode constructor for the tree above takes a value for the new node and references to the left and right subtrees. Calling the function Prog() returns a pointer to an abstract syntax tree representing the expression.

Building a Simple Recursive Descent Parser - Eric White's Blog ...

In this post, I present the start of a recursive descent parser that will parse the simple grammar that I presented previously in this series. I'll point out some key features of the code so that it is easy to see how the code works.

This post is one in a series on using LINQ to write a recursive-descent parser for SpreadsheetML formulas.

  1. Writing a Recursive Descent Parser using C# and LINQ
  2. Recursive Descent Parser using LINQ: The Augmented Backus-Naur Form Grammar
  3. Recursive Descent Parser: A Simple Grammar
  4. Creating a Collection from Singletons and Collections using LINQ
  5. Building a Simple Recursive Descent Parser
  6. Building a Simple Recursive Descent Parser (continued)
  7. Building a Simple Recursive Descent Parser (Completed Simple Parser)

The Symbol Class

We need to write a class for each symbol (production) in the grammar. Most of these classes have a method named Produce that given a collection of Symbol objects, produces an instance of that symbol. Each Produce method in the various classes is free to return null if the collection that is passed to the method can't produce that symbol.

We need to define an abstract base class for each of these classes. Most instances of classes that derive from the abstract base class will contain a list of constituent symbols, so the abstract base class includes a public property, ConstituentSymbols, of type List<Symbol>.

The ToString method for each of these classes returns the string representation of the symbol. If a symbol is comprised of a list of constituent symbols (i.e. is not a terminal), then the ToString method returns the concatenated strings of the constituent symbols.

There is a constructor for Symbol that takes a params array of Object. Each element in the params array can either be an instance of Symbol, or an object that implements IEnumerable<Symbol>. Those are the only valid types in the params array. The constructor throws an internal ParserException if anything other than one of those types is passed as an argument. I explained this idiom in the previous post in this series, Creating a Collection from Singletons and Collections using LINQ. This method also uses the StringConcatenate extension method, which I discussed in this topic in the LINQ to XML documentation.

public abstract class Symbol
{
public List<Symbol> ConstituentSymbols { get; set; }
public override string ToString() {
string s = ConstituentSymbols.Select(ct => ct.ToString()).StringConcatenate();
return s;
}
public Symbol(params Object[] symbols)
{
List<Symbol> ls = new List<Symbol>();
foreach (var item in symbols)
{
if (item is Symbol)
ls.Add((Symbol)item);
else if (item is IEnumerable<Symbol>)
foreach (var item2 in (IEnumerable<Symbol>)item)
ls.Add(item2);
else
// If this error is thrown, the parser is coded incorrectly.
throw new ParserException("Internal error");
}
ConstituentSymbols = ls;
}
public Symbol() { }
}

The OpenParenthesis Class

There are two varieties of symbols – terminal, and non-terminal. I discussed both types of symbols in the post on the Augmented Backus-Naur Form Grammar. A subclass of the Symbol class for terminal symbols is super simple. It contains an override of the ToString method, which returns the string of the terminal. Due to the semantics of C#, it also needs to include a constructor with no parameters:

public class OpenParenthesis : Symbol
{
public override string ToString() { return "("; }
public OpenParenthesis() { }
}

All of the terminals except for DecimalDigit look like the OpenParenthesis class. DecimalDigit is similar, except that its constructor takes a character, and the ToString method returns that character. To make the code as succinct as possible, digits are stored as strings instead of characters.

public class DecimalDigit : Symbol
{
private string CharacterValue;
public override string ToString() { return CharacterValue; }
public DecimalDigit(char c) { CharacterValue = c.ToString(); }
}

The Formula Class

A derived class for non-terminal symbols is also pretty simple. There is one public method, Produce, which takes a collection of Symbol objects, and then produces an instance of the class if it can; otherwise it returns null.

Recursive descent parsers are 'top-down' parsers, so it makes sense to define the Formula class first. If you examine the simple grammar that I defined, you can see that a formula is comprised of an expression, so the Formula.Produce method simply passes on the collection of Symbol objects to the Expression.Produce method. Again, due to the semantics of C#, it is necessary to define the constructor that takes a params array of Object, but it doesn't need to do anything other than call the constructor in the base class.

class Formula : Symbol
{
public static Formula Produce(IEnumerable<Symbol> symbols)
{
// formula = expression

Expression e = Expression.Produce(symbols);
return e == null ? null : new Formula(e);
}
public Formula(params Object[] symbols) : base(symbols) { }
}

The Expression Class

The Expression class is more interesting. The grammar production for Expression is as follows:

expression = *whitespace nospace-expression *whitespace

This means that an expression consists of zero or more WhiteSpace symbols, followed by one and only one NospaceExpression, followed by zero or more WhiteSpace symbols. Using LINQ, it is super easy to count the WhiteSpace symbols at the beginning and end of the collection of symbols. The Expression.Produce method can then pass the symbols in the middle (between the white space) to NospaceExpression.Produce. If that method returns a NospaceExpression object, then the method can instantiate an Expression object and return it. The Expression.Process method makes use of the SkipLast extension method. The Expression class looks like this:

public class Expression : Symbol
{
public static Expression Produce(IEnumerable<Symbol> symbols)
{
// expression = *whitespace nospace-expression *whitespace

int whiteSpaceBefore = symbols.TakeWhile(s => s is WhiteSpace).Count();
int whiteSpaceAfter = symbols.Reverse().TakeWhile(s => s is WhiteSpace).Count();
IEnumerable<Symbol> noSpaceSymbolList = symbols
.Skip(whiteSpaceBefore)
.SkipLast(whiteSpaceAfter)
.ToList();
NospaceExpression n = NospaceExpression.Produce(noSpaceSymbolList);
if (n != null)
return new Expression(
Enumerable.Repeat(new WhiteSpace(), whiteSpaceBefore),
n,
Enumerable.Repeat(new WhiteSpace(), whiteSpaceAfter));
return null;
}

public Expression (params Object[] symbols) : base(symbols) { }
}

I follow the same pattern for every class – the grammar is included in a comment, followed by a bit of LINQ code to produce an instance of the class, returning null if necessary.

Let's look at the code to instantiate the Expression object:

return new Expression(
Enumerable.Repeat(new WhiteSpace(), whiteSpaceBefore),
n,
Enumerable.Repeat(new WhiteSpace(), whiteSpaceAfter));

The arguments in yellow are collections of Symbol. The parameter n, which is of type NospaceExpression, is a singleton, so we're taking advantage of the ability to pass either singletons or collections to the Expression constructor.

The NospaceExpression Class

To keep this first example as simple as possible, we'll implement a dummy NospaceExpression class. We'll pretend that any collection of symbols is a valid NospaceExpression symbol. In the next post, I'll examine how we need to code the NospaceExpression class, as well as classes for some of the other symbols.

public class NospaceExpression : Symbol
{
public static NospaceExpression Produce(IEnumerable<Symbol> symbols)
{
return new NospaceExpression(symbols);
}

public NospaceExpression(params Object[] symbols) : base(symbols) { }
}

Projecting a Collection of Symbols from a String

One aspect of the approach that I took is that I first projected a collection of terminal Symbol objects from the collection of characters that make up the string being parsed. The string class implements IEnumerable<char> so we can do a simple select, as follows:

IEnumerable<Symbol> symbols = s.Select(c =>
{
switch (c)
{
case '0':
case '1':
case '2':
case '3':
case '4':
case '5':
case '6':
case '7':
case '8':
case '9':
return new DecimalDigit(c);
case ' ':
return new WhiteSpace();
case '+':
return new Plus();
case '-':
return new Minus();
case '*':
return new Asterisk();
case '/':
return new ForwardSlash();
case '^':
return new Caret();
case '.':
return new FullStop();
case '(':
return new OpenParenthesis();
case ')':
return new CloseParenthesis();
default:
return (Symbol)null;
}
});

If we parse the formula " (1+3) ", and dump out the terminal symbols, we see this:

Terminal Symbols
================
WhiteSpace > <
OpenParenthesis >(<
DecimalDigit >1<
Plus >+<
DecimalDigit >3<
CloseParenthesis >)<
WhiteSpace > <
WhiteSpace > <

By first transforming the string into a collection of terminals, it allows us to write more expressive code. It is easier to read this code:

If (t is WhiteSpace)

It is a bit harder to read this:

If (t is Character && t.ToString() == " " )

The DumpSymbolRecursive Method

As you can see, Symbol objects form a hierarchical tree – each symbol (except for terminals) has constituent symbols. This allows us to write a recursive method to dump symbols to a StringBuilder object.

public static void DumpSymbolRecursive(StringBuilder sb, Symbol symbol, int depth)
{
sb.Append(string.Format("{0}{1} >{2}<",
"".PadRight(depth * 2),
symbol.GetType().Name.ToString(),
symbol.ToString())).Append(Environment.NewLine);
if (symbol.ConstituentSymbols != null)
foreach (var childSymbol in symbol.ConstituentSymbols)
DumpSymbolRecursive(sb, childSymbol, depth + 1);
}

If we parse the formula " (1+3) " and dump it, we see:

Formula > (1+3) <
Expression > (1+3) <
WhiteSpace > <
NospaceExpression >(1+3)<
OpenParenthesis >(<
DecimalDigit >1<
Plus >+<
DecimalDigit >3<
CloseParenthesis >)<
WhiteSpace > <
WhiteSpace > <

We can see the instances of the Formula, Expression, and NospaceExpression classes, as well as the various terminals that make up the expression.

Complete Listing

Here is the complete listing of the first version of the SimpleFormulaParser example. You can simply cut and paste this code into Visual Studio, and run it. It doesn't have any dependencies. It transforms a formula to a collection of terminals, prints the terminals, and then uses the Formula, Expression, and NospaceExpression classes to produce a parse tree (which is incomplete because NospaceExpression is a dummy implementation).

In the next post in this series, I'll continue to develop this simple parser.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace SimpleParser
{
public abstract class Symbol
{
public List<Symbol> ConstituentSymbols { get; set; }
public override string ToString() {
string s = ConstituentSymbols.Select(ct => ct.ToString()).StringConcatenate();
return s;
}
public Symbol(params Object[] symbols)
{
List<Symbol> ls = new List<Symbol>();
foreach (var item in symbols)
{
if (item is Symbol)
ls.Add((Symbol)item);
else if (item is IEnumerable<Symbol>)
foreach (var item2 in (IEnumerable<Symbol>)item)
ls.Add(item2);
else
// If this error is thrown, the parser is coded incorrectly.
throw new ParserException("Internal error");
}
ConstituentSymbols = ls;
}
public Symbol() { }
}

public class Formula : Symbol
{
public static Formula Produce(IEnumerable<Symbol> symbols)
{
// formula = expression

Expression e = Expression.Produce(symbols);
return e == null ? null : new Formula(e);
}
public Formula(params Object[] symbols) : base(symbols) { }
}

public class Expression : Symbol
{
public static Expression Produce(IEnumerable<Symbol> symbols)
{
// expression = *whitespace nospace-expression *whitespace

int whiteSpaceBefore = symbols.TakeWhile(s => s is WhiteSpace).Count();
int whiteSpaceAfter = symbols.Reverse().TakeWhile(s => s is WhiteSpace).Count();
IEnumerable<Symbol> noSpaceSymbolList = symbols
.Skip(whiteSpaceBefore)
.SkipLast(whiteSpaceAfter)
.ToList();
NospaceExpression n = NospaceExpression.Produce(noSpaceSymbolList);
if (n != null)
return new Expression(
Enumerable.Repeat(new WhiteSpace(), whiteSpaceBefore),
n,
Enumerable.Repeat(new WhiteSpace(), whiteSpaceAfter));
return null;
}

public Expression (params Object[] symbols) : base(symbols) { }
}

public class NospaceExpression : Symbol
{
public static NospaceExpression Produce(IEnumerable<Symbol> symbols)
{
return new NospaceExpression(symbols);
}

public NospaceExpression(params Object[] symbols) : base(symbols) { }
}

public class DecimalDigit : Symbol
{
private string CharacterValue;
public override string ToString() { return CharacterValue; }
public DecimalDigit(char c) { CharacterValue = c.ToString(); }
}

public class WhiteSpace : Symbol
{
public override string ToString() { return " "; }
public WhiteSpace() { }
}

public class Plus : Symbol
{
public override string ToString() { return "+"; }
public Plus() { }
}

public class Minus : Symbol
{
public override string ToString() { return "-"; }
public Minus() { }
}

public class Asterisk : Symbol
{
public override string ToString() { return "*"; }
public Asterisk() { }
}

public class ForwardSlash : Symbol
{
public override string ToString() { return "/"; }
public ForwardSlash() { }
}

public class Caret : Symbol
{
public override string ToString() { return "^"; }
public Caret() { }
}

public class FullStop : Symbol
{
public override string ToString() { return "."; }
public FullStop() { }
}

public class OpenParenthesis : Symbol
{
public override string ToString() { return "("; }
public OpenParenthesis() { }
}

public class CloseParenthesis : Symbol
{
public override string ToString() { return ")"; }
public CloseParenthesis() { }
}

public class SimpleFormulaParser
{
public static Symbol ParseFormula(string s)
{
IEnumerable<Symbol> symbols = s.Select(c =>
{
switch (c)
{
case '0':
case '1':
case '2':
case '3':
case '4':
case '5':
case '6':
case '7':
case '8':
case '9':
return new DecimalDigit(c);
case ' ':
return new WhiteSpace();
case '+':
return new Plus();
case '-':
return new Minus();
case '*':
return new Asterisk();
case '/':
return new ForwardSlash();
case '^':
return new Caret();
case '.':
return new FullStop();
case '(':
return new OpenParenthesis();
case ')':
return new CloseParenthesis();
default:
return (Symbol)null;
}
});
#if true
if (symbols.Any())
{
Console.WriteLine("Terminal Symbols");
Console.WriteLine("================");
foreach (var terminal in symbols)
Console.WriteLine("{0} >{1}<", terminal.GetType().Name.ToString(),
terminal.ToString());
Console.WriteLine();
}
#endif
Formula formula = Formula.Produce(symbols);
if (formula == null)
throw new ParserException("Invalid formula");
return formula;
}

public static void DumpSymbolRecursive(StringBuilder sb, Symbol symbol, int depth)
{
sb.Append(string.Format("{0}{1} >{2}<",
"".PadRight(depth * 2),
symbol.GetType().Name.ToString(),
symbol.ToString())).Append(Environment.NewLine);
if (symbol.ConstituentSymbols != null)
foreach (var childSymbol in symbol.ConstituentSymbols)
DumpSymbolRecursive(sb, childSymbol, depth + 1);
}
}

class Program
{
static void Main(string[] args)
{
StringBuilder sb = new StringBuilder();
Symbol f = SimpleFormulaParser.ParseFormula(" (1+3) ");
SimpleFormulaParser.DumpSymbolRecursive(sb, f, 0);
Console.WriteLine(sb.ToString());
}
}

public class ParserException : Exception
{
public ParserException(string message) : base(message) { }
}

public static class MyExtensions
{
public static IEnumerable<T> SkipLast<T>(this IEnumerable<T> source,
int count)
{
Queue<T> saveList = new Queue<T>();
int saved = 0;
foreach (T item in source)
{
if (saved < count)
{
saveList.Enqueue(item);
++saved;
continue;
}
saveList.Enqueue(item);
yield return saveList.Dequeue();
}
yield break;
}

public static string StringConcatenate(this IEnumerable<string> source)
{
StringBuilder sb = new StringBuilder();
foreach (string s in source)
sb.Append(s);
return sb.ToString();
}

public static string StringConcatenate<T>(
this IEnumerable<T> source,
Func<T, string> projectionFunc)
{
return source.Aggregate(new StringBuilder(),
(s, i) => s.Append(projectionFunc(i)),
s => s.ToString());
}
}
}

Recursive Descent Parser A Simple Grammar - Eric White's Blog - Site Home - MSDN Blogs

To learn how recursive descent parsers work, it is helpful to implement a very simple grammar, so for pedagogical purposes, I've defined a grammar for simple arithmetic expressions. The parser will construct a syntax tree from expressions that we can then examine as necessary. Just for fun, after implementing the parser, we will write a small method that will evaluate the formulas.

This post is one in a series on using LINQ to write a recursive-descent parser for SpreadsheetML formulas.

  1. Writing a Recursive Descent Parser using C# and LINQ
  2. Recursive Descent Parser using LINQ: The Augmented Backus-Naur Form Grammar
  3. Recursive Descent Parser: A Simple Grammar
  4. Creating a Collection from Singletons and Collections using LINQ
  5. Building a Simple Recursive Descent Parser
  6. Building a Simple Recursive Descent Parser (continued)
  7. Building a Simple Recursive Descent Parser (Completed Simple Parser)

In these expressions, operands are floating point numbers, but for simplicity, I've eliminated the ability to have exponents. Floating point numbers are the only variety of operand; to demonstrate how to write the parser, it's not necessary to include the idea of variables or other types of operands. When writing the parser for Excel formulas, we'll need to deal with both exponents and other types of operands, as well as many more varieties of issues.

Recommended Links

Wikipedia

Simple Top-Down Parsing in Python

YARD (Yet Another Recursive Descent Parser) by Christopher Diggins

The Spirit recursive descent parser compiler

lecture06

Recursive Descent Parsing

Recursive descent parsing

CS 3721 -- Recursive Descent Parsing

Javanotes 5.1.2, Section 9.5 -- A Simple Recursive Descent Parser

3.2.1 Recursive Descent Parsing

Building a Simple Recursive Descent Parser - Eric White's Blog ...



Etc

Society

Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers :   Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism  : The Iron Law of Oligarchy : Libertarian Philosophy

Quotes

War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda  : SE quotes : Language Design and Programming Quotes : Random IT-related quotesSomerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose BierceBernard Shaw : Mark Twain Quotes

Bulletin:

Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 :  Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method  : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law

History:

Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds  : Larry Wall  : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOSProgramming Languages History : PL/1 : Simula 67 : C : History of GCC developmentScripting Languages : Perl history   : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

Classic books:

The Peter Principle : Parkinson Law : 1984 : The Mythical Man-MonthHow to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

Most popular humor pages:

Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

The Last but not Least Technology is dominated by two types of people: those who understand what they do not manage and those who manage what they do not understand ~Archibald Putt. Ph.D


Copyright © 1996-2021 by Softpanorama Society. www.softpanorama.org was initially created as a service to the (now defunct) UN Sustainable Development Networking Programme (SDNP) without any remuneration. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

You can use PayPal to to buy a cup of coffee for authors of this site

Disclaimer:

The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the Softpanorama society. We do not warrant the correctness of the information provided or its fitness for any purpose. The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

Last modified: March 12, 2019