Writing a Simple Recursive Descent Parser
Someone asked a question recently on the local ruby list. They were looking for an implementation of a parser that would handle keywords and field specifications, like this:
Now, you have to understand that the compiler class I took in college (almost 20 years ago?!) was one of my favorites. It completely blew my mind. I still have (and love) the Dragon Book by Aho, et. al., and every once in a while–for old time’s sake–I take it off the shelf, thumb through it, and wax nostalgic. Ten-plus years ago I even invented and implemented some simple programming languages…but my career, as a whole, has been remarkably void of opportunities to implement parsers.
So, when this query came along, I perked up. A parser? Hmmm!
There are lots of different ways to implement these, but I decided to go with a recursive descent parser. These have always been my favorite, and–frankly–I couldn’t remember off the top of my head how to do any of the others.
So, first, I took the informal specification that was given by the OP, and converted it into Backus-Naur Form (BNF). (Technically, I guess I used an EBNF–Extended Backus-Naur Form–but this was just for my own use anyway.)
Caveat: I intentionally avoided operator precedence, so in this implementation AND
and OR
have equivalent precedence. Also, the string parsing is very naive, for simplisity’s (and demonstration’s) sake.
Honestly, converting the description into a BNF is usually the hardest part, but once you’ve got that, the rest of the parser flows very naturally. Each of the left-hand sides of those BNF definitions becomes a method, which recursively calls the appropriate methods corresponding to the items on the right-hand side. (Thus, the “recursive” in “recursive descent”.)
The idea here is that the parse process will return an AST–an abstract syntax tree–which represents the input. To support that tree, I defined two simple structures: one for an expression with an operator, and one for a field specification:
Then, I jumped right in at the top and wrote the #parse
method. It accepts a single argument, the input to parse (as a string). I used Ruby’s StringScanner
class to do the lexical analysis (scanning), because there’s rarely a reason not to, really. StringScanner
is awesome!
Recall the BNF from before. The expr
element is the top-level, so my #parse
method calls that. The implementation for #parse_expr
is nice and simple:
Our expr
element (in the BNF) may be a term
, or a term
followed by an operator. Since it always starts with a term
, we first call the corresponding #parse_term
method. Then, we skip any whitespace, and look for an operator. If we find one, we instantiate a new Expression
, give it the operator, the left operand, and parse the right operatand (as an expr
–note the recursion!). Otherwise, we simply return the operand we parsed at the start.
Easy!
Next, let’s look at how #parse_term
is defined:
We start by skipping (or “eating”, as its called) any whitespace. Then we look for a value. Look at the BNF again: see how a value
may be either an atom
, or a parenthesized expr
? Comparing that with the term
definition, we can see that a term
may start with either an atom
, or an expr
. This means we can call parse_value
, and if the result is an Expression
, then we’re done and we just return it. Otherwise, we need to consider the case where we’ve got a field specification.
To do that, we check the next character. If it’s a colon, we instantiate a new Field
and return it, parsing a new value in the process. Otherwise, we just return the value we already parsed.
So, what about parse_value
, then? Surely it’ll be a beast? I mean, it can’t all be this easy, right?
Ha ha! You’re hilarious. Check this out.
We do have our first instance of error handling, here. We save the current position in the scanner, and then look for an opening parenthesis. If we find one, we parse (and return) an expr
, and then eat a closing parenthesis. If no closing parenthesis is found, though, that’s an error! We raise an exception, telling where in the string the expression began.
If, on the other hand, there was no opening parenthesis to begin with, the value must be an atom
, and we parse that instead.
Two more methods to go! The atom
parser is really straight-forward:
An atom
is either a word (/\w+/
) or a quoted string. If it is neither of those, we raise an exception and show where the error occurred.
Last method, then: parsing quoted strings.
We save the position (for error reporting), and then look to see what kind of quotation marks we’re dealing with. We then scan all characters up to (but not including) the next instance of that quotation mark, and return them, making sure to eat the closing quotation mark in the process. If there was no closing quotation mark, we report that error.
And that’s it! Seriously. We can now parse queries like this:
Recursive descent parsers are so elegant! There is just something about how naturally they mimic the grammar…and how clearly the recursion describes the relationship between the different elements of the syntax… It’s not ideal for every grammar, but for simple cases like this, I really, really dig it.
If parsers have been intimidating to you in the past, hopefully this has shown you how straightforward they can be. In fact, they can be quite fun!
Here’s a gist of my complete implementation, even with a few specs (intended more as examples than actual tests). Enjoy!