Try my new book!

Algorithms, circle mazes, hex grids, masking, weaving, braiding, 3D and 4D grids, spheres, and more!

The Buckblog

assorted ramblings by Jamis Buck

Jul 2015

Writing a Simple Recursive Descent Parser

30 July 2015 — A simple implementation of a field-based query string, with binary operations, using a recursive descent parser — 6-minute read

Someone asked a question recently on the local ruby list. They were looking for an implementation of a parser that would handle keywords and field specifications, like this:

country:(ru OR cn OR "South Korea")
state:(texas OR ok)
company:"ACME Products"
last:smith AND first:john

Now, you have to understand that the compiler class I took in college (almost 20 years ago?!) was one of my favorites. It completely blew my mind. I still have (and love) the Dragon Book by Aho, et. al., and every once in a while–for old time’s sake–I take it off the shelf, thumb through it, and wax nostalgic. Ten-plus years ago I even invented and implemented some simple programming languages…but my career, as a whole, has been remarkably void of opportunities to implement parsers.

So, when this query came along, I perked up. A parser? Hmmm!

There are lots of different ways to implement these, but I decided to go with a recursive descent parser. These have always been my favorite, and–frankly–I couldn’t remember off the top of my head how to do any of the others.

So, first, I took the informal specification that was given by the OP, and converted it into Backus-Naur Form (BNF). (Technically, I guess I used an EBNF–Extended Backus-Naur Form–but this was just for my own use anyway.)

expr := term
      | term AND expr
      | term OR expr
term := value
      | atom ':' value
atom := word+
      | quoted_string
value := atom
       | '(' expr ')'

Caveat: I intentionally avoided operator precedence, so in this implementation AND and OR have equivalent precedence. Also, the string parsing is very naive, for simplisity’s (and demonstration’s) sake.

Honestly, converting the description into a BNF is usually the hardest part, but once you’ve got that, the rest of the parser flows very naturally. Each of the left-hand sides of those BNF definitions becomes a method, which recursively calls the appropriate methods corresponding to the items on the right-hand side. (Thus, the “recursive” in “recursive descent”.)

The idea here is that the parse process will return an AST–an abstract syntax tree–which represents the input. To support that tree, I defined two simple structures: one for an expression with an operator, and one for a field specification:

class Expression < Struct.new(:op, :left, :right)
end

class Field < Struct.new(:name, :value)
end

Then, I jumped right in at the top and wrote the #parse method. It accepts a single argument, the input to parse (as a string). I used Ruby’s StringScanner class to do the lexical analysis (scanning), because there’s rarely a reason not to, really. StringScanner is awesome!

def parse(string)
  scanner = StringScanner.new(string)
  parse_expr(scanner)
end

Recall the BNF from before. The expr element is the top-level, so my #parse method calls that. The implementation for #parse_expr is nice and simple:

def parse_expr(scanner)
  left = parse_term(scanner)

  scanner.skip(/\s+/)
  op = scanner.scan(/AND|OR/i)

  if op
    Expression.new(op.downcase.to_sym, left, parse_expr(scanner))
  else
    left
  end
end

Our expr element (in the BNF) may be a term, or a term followed by an operator. Since it always starts with a term, we first call the corresponding #parse_term method. Then, we skip any whitespace, and look for an operator. If we find one, we instantiate a new Expression, give it the operator, the left operand, and parse the right operatand (as an expr–note the recursion!). Otherwise, we simply return the operand we parsed at the start.

Easy!

Next, let’s look at how #parse_term is defined:

def parse_term(scanner)
  scanner.skip(/\s+/)

  value = parse_value(scanner)
  return value if value.is_a?(Expression)

  if scanner.skip(/:/)
    Field.new(value, parse_value(scanner))
  else
    value
  end
end

We start by skipping (or “eating”, as its called) any whitespace. Then we look for a value. Look at the BNF again: see how a value may be either an atom, or a parenthesized expr? Comparing that with the term definition, we can see that a term may start with either an atom, or an expr. This means we can call parse_value, and if the result is an Expression, then we’re done and we just return it. Otherwise, we need to consider the case where we’ve got a field specification.

To do that, we check the next character. If it’s a colon, we instantiate a new Field and return it, parsing a new value in the process. Otherwise, we just return the value we already parsed.

So, what about parse_value, then? Surely it’ll be a beast? I mean, it can’t all be this easy, right?

Ha ha! You’re hilarious. Check this out.

def parse_value(scanner)
  start = scanner.pos
  if scanner.skip(/\(/)
    parse_expr(scanner).tap do
      scanner.scan(/\)/) or
        raise "expression not terminated (start at #{start})"
    end
  else
    parse_atom(scanner)
  end
end

We do have our first instance of error handling, here. We save the current position in the scanner, and then look for an opening parenthesis. If we find one, we parse (and return) an expr, and then eat a closing parenthesis. If no closing parenthesis is found, though, that’s an error! We raise an exception, telling where in the string the expression began.

If, on the other hand, there was no opening parenthesis to begin with, the value must be an atom, and we parse that instead.

Two more methods to go! The atom parser is really straight-forward:

def parse_atom(scanner)
  scanner.scan(/\w+/) ||
    parse_quoted_string(scanner) ||
    raise("expected an atom at #{scanner.pos}")
end

An atom is either a word (/\w+/) or a quoted string. If it is neither of those, we raise an exception and show where the error occurred.

Last method, then: parsing quoted strings.

def parse_quoted_string(scanner)
  start = scanner.pos
  delim = scanner.scan(/['"]/)
  if delim
    scanner.scan(/[^#{delim}]*/).tap do
      scanner.scan(/#{delim}/) or
        raise "quoted string not terminated (start at #{start})"
    end
  end
end

We save the position (for error reporting), and then look to see what kind of quotation marks we’re dealing with. We then scan all characters up to (but not including) the next instance of that quotation mark, and return them, making sure to eat the closing quotation mark in the process. If there was no closing quotation mark, we report that error.

And that’s it! Seriously. We can now parse queries like this:

# simple words
parse "compilers"
#-> "compilers"

# field specifications
parse "subject:compilers"
#-> Field.new("subject", "compilers")

# boolean operations
parse "subject:compilers or author:Aho"
#-> Expression.new(:or,
#      Field.new("subject", "compilers"),
#      Field.new("author", "Aho"))

# boolean operations within field specifications
parse "subject:(compilers OR parsers)"
#-> Field.new("subject",
#      Expression.new(:or, "compilers", "parsers")

Recursive descent parsers are so elegant! There is just something about how naturally they mimic the grammar…and how clearly the recursion describes the relationship between the different elements of the syntax… It’s not ideal for every grammar, but for simple cases like this, I really, really dig it.

If parsers have been intimidating to you in the past, hopefully this has shown you how straightforward they can be. In fact, they can be quite fun!

Here’s a gist of my complete implementation, even with a few specs (intended more as examples than actual tests). Enjoy!

tar.gz in Ruby

23 July 2015 — A method is described for reading and writing tar and gzip files, using only the Ruby standard library — 5-minute read

Mazes for Programmers

8 July 2015 — The announcement of a completed project with expressions of relief and disbelief, and a brief animation to celebrate the occassion. The author looks forward to other projects to come. — 1-minute read
May 2015

Experimenting with L-Systems

7 May 2015 — An overview of L-system fractals, with a simple implementation in Ruby. An argument is given in favor of exploration, experimentation, and play. — 6-minute read
Mar 2015

Playing with Constants, Methods, and Superclasses

24 March 2015 — A few curious Rubyisms of dubious use, which may yet be worth knowing about — 3-minute read

Task Tracking for Neurochemical Brains

17 March 2015 — 4-minute read

Feb 2015

Mazes for Programmers: Beta!

4 February 2015 — 2-minute read

Jan 2015

Lessons from the Kitchen

30 January 2015 — 5-minute read

Hanging Out a Shingle

26 January 2015 — 1-minute read

Getting Back in the Pool

20 January 2015 — 2-minute read

A Better Recursive Division Algorithm

15 January 2015 — 5-minute read

Winding Back Up

13 January 2015 — 3-minute read

Sep 2011

Winding down...

1 September 2011 — 1-minute read

Jun 2011

Sharing the Inheritance Hierarchy

7 June 2011 — 3-minute read

Mar 2011

Maze Generation: More weave mazes

17 March 2011 — 8-minute read

Maze Generation: Weave mazes

4 March 2011 — 11-minute read

Feb 2011

Weave Mazes: Your Take?

28 February 2011 — 1-minute read

Programming Language Survey Results

22 February 2011 — 3-minute read

Kaleidoscope

19 February 2011 — 2-minute read

Mazes in CoffeeScript

9 February 2011 — 2-minute read

Maze Generation: Algorithm Recap

7 February 2011 — 5-minute read

Maze Generation: Sidewinder algorithm

3 February 2011 — 12-minute read

Maze Generation: Binary Tree algorithm

1 February 2011 — 7-minute read

Jan 2011

Maze Generation: Growing Tree algorithm

27 January 2011 — 9-minute read

Maze Generation: Hunt-and-Kill algorithm

24 January 2011 — 15-minute read

Maze Generation: Wilson's algorithm

20 January 2011 — 16-minute read

Maze Generation: Aldous-Broder algorithm

17 January 2011 — 11-minute read

Maze Generation: Recursive Division

12 January 2011 — 11-minute read

Maze Generation: Prim's Algorithm

10 January 2011 — 10-minute read

Maze Generation: Kruskal's Algorithm

3 January 2011 — 7-minute read

Way Back

The Buckblog Archives

Dating to 2004 — Hundreds more articles