A new parser for Ioke

Last week I finally bit the bullet and rewrote the Ioke parser. I’m pretty happy about the end result actually, but it does involve moving away from Antlr’s as a parser generator. In fact, the new parser is handwritten – and as such goes against my general opinion to generate everything possible. I would like to quickly take a look at the reasons for doing this and also what the new parser will give Ioke.

For reference, the way the parser used to work was that the Antlr generated lexer and parser gave the Ioke runtime an Antlr Tree structure. This tree structure was then walked and transformed into chained Message’s, which is the AST that Ioke uses internally. Several other things were also done at this stage, including separating message chains on comma-borders. Most significantly the processing to put together interpolated strings and regular expressions happened at this stage. Sadly, the code to handle all that was complex, ugly, slow and frail. After this stage, operator shuffling happened. That part is still the same.

There were several problems I wanted to solve, but the main one was the ugliness of the algorithm. It wasn’t clear from the parser how an interpolated expression mapped into the AST, and the generated code added several complications that frankly weren’t necessary.

Ioke is a language with an extremely simple base syntax. It is only slightly more complicated than the typical Lisp parser, and there is almost no parser-level productions needed. So the new parser does away with the lexer/parser distinction and does everything in one pass. There is no need for lookahead at the token level, so this turns out to be a clear win. The code is actually much simpler now, and the Message AST is created inline in the new parser. When it comes to interpolation, instead of the semantic predicates and global stacks I had to use in the Antlr parser, I just do the obvious recursive interpolation. The code is simple to understand and quite efficient too.

At the end of the day, I did expect to see some performance improvements too. They turned out to be substantial. Parsing is about 2.5 times faster, and startup speed has improved by about 30%. The distribution size will be substantially smaller since I don’t need to ship the Antlr runtime libraries. And building the project is also much faster.

But the real gain is actually in maintainability of the code. It will be much easier for me to extend the parser now. I can do nice things to make the syntax more open ended and more powerful in ways that would be very inconvenient in Antlr. The error messages are much better since I have control over all the error states. In fact, there are only 13 distinct error messages in the new parser, and they are all very clear on what has gone wrong – I never did the work in the old parser to support that, but I get that almost for free in the new one.

Another thing I’ve been considering is to add reader macros to Ioke – and that would also have been quite painful with the Antlr parser generator. So all in all I’m very happy about the new parser, and I think it will definitely make it easier for the project going forward.

This blog post is in no way saying that Antlr is bad in any way. I like Antlr a lot – it’s a great tool. But it just wasn’t the right tool for Ioke’s syntax.

4 Comments, Comment or Ping

  1. Are there other styles of parsers that might have been better for ioke?

    September 23rd, 2009

  2. I usually prefer hand-written parsers for all but the most convoluted languages. Error messages alone make it worthwhile, but performance is also usually better because you can optimize for specific cases. Of course, the disadvantage is that it becomes very hard to handle more complex syntax. Imagine trying to hand-write a parser for Ruby! Obviously, Ioke doesn’t have this issue.

    There are some very nice parser generators out there, and some nice tools which create parsers without generation (e.g. parser combinators), but they aren’t the be-all/end-all of parsing. I usually try the following alternatives in order, with the last being my preferred approach, but also the one which requires the most effort:

    * Generalized LL (auto-generated in-memory using combinators)
    * Recursive-Ascent (hand-written from states auto-generated by Bison)
    * Recursive-Descent (hand-written)

    The final option provides a nice combination of extensibility, readability and solid error messages, but it requires a lot of care in the grammar. Recursive-Ascent requires less care and still provides very good error messages, but it’s very difficult to extend (you end up re-writing a lot). GLL is the absolute bomb as far as grammar flexibility (anything goes), but errors are often tricky and performance sometimes leaves something to be desired (it’s cubic, not linear).

    All in all, it’s a tradeoff, and I think you’ve gone with the right approach.

    September 23rd, 2009

  3. I really like using Antlr for mocking up the toy languages that I build, but I agree that it generates some really convoluted code (that’s one of the downfalls of being a generic parser generator). Once I get the syntax worked out and the initial language compiling, I will often go back and hand code the lexer and parser.

    It’s interesting that most people that haven’t done language development often focus on the lexer and parser, when that is often the easiest part of development (C++ excepted).

    I’ll be interested to look at your latest work.

    September 23rd, 2009

  4. Mason

    Fantastic. When do you think you’ll break this off into a lettered release? (I ask that in the most non-forward, non-expectant manner – I realize it’s already available @ GitHub.)

    I like the way you’ve been approaching this project. For many, the primary reason to move to a handwritten parser would have been the speed benefits. Instead, you’ve focused on the language – Getting It Right (however arbitrary “right” might be, being a one-man show) – and the speed boost came as an added bonus.

    September 24th, 2009

Reply to “A new parser for Ioke”