The maze book for programmers!

Algorithms, circle mazes, hex grids, masking, weaving, braiding, 3D and 4D grids, spheres, and more!

# The Buckblog

## Writing a Simple Recursive Descent Parser

30 July 2015 — A simple implementation of a field-based query string, with binary operations, using a recursive descent parser — 5-minute read

Someone asked a question recently on the local ruby list. They were looking for an implementation of a parser that would handle keywords and field specifications, like this:

Now, you have to understand that the compiler class I took in college (almost 20 years ago?!) was one of my favorites. It completely blew my mind. I still have (and love) the Dragon Book by Aho, et. al., and every once in a while–for old time’s sake–I take it off the shelf, thumb through it, and wax nostalgic. Ten-plus years ago I even invented and implemented some simple programming languages…but my career, as a whole, has been remarkably void of opportunities to implement parsers.

So, when this query came along, I perked up. A parser? Hmmm!

There are lots of different ways to implement these, but I decided to go with a recursive descent parser. These have always been my favorite, and–frankly–I couldn’t remember off the top of my head how to do any of the others.

So, first, I took the informal specification that was given by the OP, and converted it into Backus-Naur Form (BNF). (Technically, I guess I used an EBNF–Extended Backus-Naur Form–but this was just for my own use anyway.)

Caveat: I intentionally avoided operator precedence, so in this implementation `AND` and `OR` have equivalent precedence. Also, the string parsing is very naive, for simplisity’s (and demonstration’s) sake.

Honestly, converting the description into a BNF is usually the hardest part, but once you’ve got that, the rest of the parser flows very naturally. Each of the left-hand sides of those BNF definitions becomes a method, which recursively calls the appropriate methods corresponding to the items on the right-hand side. (Thus, the “recursive” in “recursive descent”.)

The idea here is that the parse process will return an AST–an abstract syntax tree–which represents the input. To support that tree, I defined two simple structures: one for an expression with an operator, and one for a field specification:

Then, I jumped right in at the top and wrote the `#parse` method. It accepts a single argument, the input to parse (as a string). I used Ruby’s `StringScanner` class to do the lexical analysis (scanning), because there’s rarely a reason not to, really. `StringScanner` is awesome!

Recall the BNF from before. The `expr` element is the top-level, so my `#parse` method calls that. The implementation for `#parse_expr` is nice and simple:

Our `expr` element (in the BNF) may be a `term`, or a `term` followed by an operator. Since it always starts with a `term`, we first call the corresponding `#parse_term` method. Then, we skip any whitespace, and look for an operator. If we find one, we instantiate a new `Expression`, give it the operator, the left operand, and parse the right operatand (as an `expr`–note the recursion!). Otherwise, we simply return the operand we parsed at the start.

Easy!

Next, let’s look at how `#parse_term` is defined:

We start by skipping (or “eating”, as its called) any whitespace. Then we look for a value. Look at the BNF again: see how a `value` may be either an `atom`, or a parenthesized `expr`? Comparing that with the `term` definition, we can see that a `term` may start with either an `atom`, or an `expr`. This means we can call `parse_value`, and if the result is an `Expression`, then we’re done and we just return it. Otherwise, we need to consider the case where we’ve got a field specification.

To do that, we check the next character. If it’s a colon, we instantiate a new `Field` and return it, parsing a new value in the process. Otherwise, we just return the value we already parsed.

So, what about `parse_value`, then? Surely it’ll be a beast? I mean, it can’t all be this easy, right?

Ha ha! You’re hilarious. Check this out.

We do have our first instance of error handling, here. We save the current position in the scanner, and then look for an opening parenthesis. If we find one, we parse (and return) an `expr`, and then eat a closing parenthesis. If no closing parenthesis is found, though, that’s an error! We raise an exception, telling where in the string the expression began.

If, on the other hand, there was no opening parenthesis to begin with, the value must be an `atom`, and we parse that instead.

Two more methods to go! The `atom` parser is really straight-forward:

An `atom` is either a word (`/\w+/`) or a quoted string. If it is neither of those, we raise an exception and show where the error occurred.

Last method, then: parsing quoted strings.

We save the position (for error reporting), and then look to see what kind of quotation marks we’re dealing with. We then scan all characters up to (but not including) the next instance of that quotation mark, and return them, making sure to eat the closing quotation mark in the process. If there was no closing quotation mark, we report that error.

And that’s it! Seriously. We can now parse queries like this:

Recursive descent parsers are so elegant! There is just something about how naturally they mimic the grammar…and how clearly the recursion describes the relationship between the different elements of the syntax… It’s not ideal for every grammar, but for simple cases like this, I really, really dig it.

If parsers have been intimidating to you in the past, hopefully this has shown you how straightforward they can be. In fact, they can be quite fun!

Here’s a gist of my complete implementation, even with a few specs (intended more as examples than actual tests). Enjoy!