Writing a Simple Recursive Descent Parser
Someone asked a question recently on the local ruby list. They were looking for an implementation of a parser that would handle keywords and field specifications, like this:
Now, you have to understand that the compiler class I took in college (almost 20 years ago?!) was one of my favorites. It completely blew my mind. I still have (and love) the Dragon Book by Aho, et. al., and every once in a while–for old time’s sake–I take it off the shelf, thumb through it, and wax nostalgic. Ten-plus years ago I even invented and implemented some simple programming languages…but my career, as a whole, has been remarkably void of opportunities to implement parsers.
So, when this query came along, I perked up. A parser? Hmmm!
There are lots of different ways to implement these, but I decided to go with a recursive descent parser. These have always been my favorite, and–frankly–I couldn’t remember off the top of my head how to do any of the others.
So, first, I took the informal specification that was given by the OP, and converted it into Backus-Naur Form (BNF). (Technically, I guess I used an EBNF–Extended Backus-Naur Form–but this was just for my own use anyway.)
Caveat: I intentionally avoided operator precedence, so in this implementation
OR have equivalent precedence. Also, the string parsing is very naive, for simplisity’s (and demonstration’s) sake.
Honestly, converting the description into a BNF is usually the hardest part, but once you’ve got that, the rest of the parser flows very naturally. Each of the left-hand sides of those BNF definitions becomes a method, which recursively calls the appropriate methods corresponding to the items on the right-hand side. (Thus, the “recursive” in “recursive descent”.)
The idea here is that the parse process will return an AST–an abstract syntax tree–which represents the input. To support that tree, I defined two simple structures: one for an expression with an operator, and one for a field specification:
Then, I jumped right in at the top and wrote the
#parse method. It accepts a single argument, the input to parse (as a string). I used Ruby’s
StringScanner class to do the lexical analysis (scanning), because there’s rarely a reason not to, really.
StringScanner is awesome!
Recall the BNF from before. The
expr element is the top-level, so my
#parse method calls that. The implementation for
#parse_expr is nice and simple:
expr element (in the BNF) may be a
term, or a
term followed by an operator. Since it always starts with a
term, we first call the corresponding
#parse_term method. Then, we skip any whitespace, and look for an operator. If we find one, we instantiate a new
Expression, give it the operator, the left operand, and parse the right operatand (as an
expr–note the recursion!). Otherwise, we simply return the operand we parsed at the start.
Next, let’s look at how
#parse_term is defined:
We start by skipping (or “eating”, as its called) any whitespace. Then we look for a value. Look at the BNF again: see how a
value may be either an
atom, or a parenthesized
expr? Comparing that with the
term definition, we can see that a
term may start with either an
atom, or an
expr. This means we can call
parse_value, and if the result is an
Expression, then we’re done and we just return it. Otherwise, we need to consider the case where we’ve got a field specification.
To do that, we check the next character. If it’s a colon, we instantiate a new
Field and return it, parsing a new value in the process. Otherwise, we just return the value we already parsed.
So, what about
parse_value, then? Surely it’ll be a beast? I mean, it can’t all be this easy, right?
Ha ha! You’re hilarious. Check this out.
We do have our first instance of error handling, here. We save the current position in the scanner, and then look for an opening parenthesis. If we find one, we parse (and return) an
expr, and then eat a closing parenthesis. If no closing parenthesis is found, though, that’s an error! We raise an exception, telling where in the string the expression began.
If, on the other hand, there was no opening parenthesis to begin with, the value must be an
atom, and we parse that instead.
Two more methods to go! The
atom parser is really straight-forward:
atom is either a word (
/\w+/) or a quoted string. If it is neither of those, we raise an exception and show where the error occurred.
Last method, then: parsing quoted strings.
We save the position (for error reporting), and then look to see what kind of quotation marks we’re dealing with. We then scan all characters up to (but not including) the next instance of that quotation mark, and return them, making sure to eat the closing quotation mark in the process. If there was no closing quotation mark, we report that error.
And that’s it! Seriously. We can now parse queries like this:
Recursive descent parsers are so elegant! There is just something about how naturally they mimic the grammar…and how clearly the recursion describes the relationship between the different elements of the syntax… It’s not ideal for every grammar, but for simple cases like this, I really, really dig it.
If parsers have been intimidating to you in the past, hopefully this has shown you how straightforward they can be. In fact, they can be quite fun!
Here’s a gist of my complete implementation, even with a few specs (intended more as examples than actual tests). Enjoy!