Porting Syck – A story of C, parser generators and general ugliness


As mentioned earlier, a few weeks back I decided to port Syck to Java, for the sake of JRuby. I will detail some interesting experiences in this blog post.

First some introductions. Syck is broadly divided into two pieces – the core library and the language adaptations. The language adaptions are the stuff that is specific to each language and provides the language level APIs. I won’t talk that much about this side of things – most of this post is about the core library.

Any YAML processor is divided into a parser and a emitter. The Syck emitter is pretty straightforward, and so is the Yecht port of it. The parser on the other hand need to be able to do several different things. First of all, it need to be able to handle the fact that YAML is context dependent, which means you can’t do it with a typical parser generator. So the way all YAML engines do it is by having a pretty smart scanner/tokenizer, and then a straight forward parser on top of this. The scanner takes care of the context dependent bits (which are mainly regarding indentation) and abstracts away this so the parser can just make sure to put documents together.

Another piece that is generally necessary is something that checks what type a value is of. Since YAML supports several different core types, and a YAML engine should always strive to read things without needing hints, the processor need to know that the YAML scalar “foo” has tag “tag:yaml.org,2002:str” while “42” has tag “tag:yaml.org,2002:int”. Of course there are many more types, including several different versions of integers, floats and timestamps. All of this handling is incidental to the parsing of the document, though. A scalar is a scalar no matter what type it is. So the recognizing of implicits is generally not done in the scanner or the parser, but in a separate scanner.

Syck uses re2c for the token scanner and the implicit scanner, and it uses Bison for the parser. My first version of Yecht was as straight forward as I could make it. I ported the backend of re2c to generate Java, and then I used JACC instead of Bison. So far all good. But when it was done, this initial version was pretty slow. I benchmarked the core library by supplying a nodehandler that didn’t do anything, and the did the same thing for Syck. I also added a small piece that just ran the scanner piece. And what I saw got me a bit depressed. I had gotten Yecht to be totally compatible with Syck, but it was buttslow. For comparison I used a random big YAML document without any real advanced features. Just a combination of mappings, sequences and scalars. For reference, the Syck numbers were these:

scanning ../jruby_newyaml/bench/big_yaml.yml 10000 times took  13959ms
parsing  ../jruby_newyaml/bench/big_yaml.yml 10000 times took  16101ms

And JvYAMLb (the engine I was replacing in JRuby):

scanning ../jruby_newyaml/bench/big_yaml.yml 10000 times took   5325ms
parsing  ../jruby_newyaml/bench/big_yaml.yml 10000 times took  15794ms

And my first version of Yecht:

scanning ../jruby_newyaml/bench/big_yaml.yml 10000 times took  93658ms
parsing  ../jruby_newyaml/bench/big_yaml.yml 10000 times took 117213ms

Ouch. The scanner is 7 times slower, and so is the parsing. And comparing with JvYAMLb, it looks even worse. What’s going on here?

As it happens, I didn’t tell you exactly how I ported the token scanner. The Syck implementation of this used about 10 different gotos to jump between different cases. Now, that’s a pretty fast operation in C. Using Java, I had to do something different. Since I wanted to make the first version of Yecht as close a port as possible, I decided to use a standard transformation to mimic the gotos. In Java you can do this by wrapping the area with a while(true) loop that has a label – and then you can use a variable that contains a number to indicate which goto point to go to next. The actual code lives in a switch statement inside of the while-loop. This code is a bit ugly but if you define some constants you end up with code that looks reasonable much like the C code. Just replace “goto plain3;” with “gotoPoint = plain3; break gotoNext;”.

The first thing I did after finding everything was slow was to try to pinpoint the place where all the performance disappeared. This is actually surprisingly hard. But in this case it turned out to be yylex, which contains the full token scanning logic including my homegrown gotos. Then Charles reminded me about one of those lessons we have learned while developing JRuby – namely that Hotspot really doesn’t like large switch statements. So my first step was to try to untangle the logic of all the gotos.

That was actually not that hard, since the logic bottomed out quite quickly. I managed to replace the switch-based goto-logic with separate methods that called each other instead. That was the only thing I did, and suddenly the scanner wasn’t a problem anymore. These were the numbers after I removed that goto-logic:

scanning ../jruby_newyaml/bench/big_yaml.yml 10000 times took  12207ms
parsing  ../jruby_newyaml/bench/big_yaml.yml 10000 times took  31280ms

Yes, you’re reading that right. The scanning process became 7 times faster by just removing that switch statement and making it into separate methods. Hotspot really doesn’t like large switch statements, and it does like smaller methods. In this case it also made it easier to find the methods that were hotspots in the scanner too.

After this fix it was obvious that the parsing component was the main problem now. Since the approach of removing large switch statements had worked so well in the scanner, I decided to try the same approach in the parser. The parser generator I used – JACC – happens to be one of those generators that generates only code, no tables. Once upon a time this was probably the right behavior for Java, to get good performance, but it’s definitely not the right choice anymore. So I switched from JACC to Jay, and ended up with these numbers:

scanning ../jruby_newyaml/bench/big_yaml.yml 10000 times took  12960ms
parsing  ../jruby_newyaml/bench/big_yaml.yml 10000 times took  15411ms

Nice, huh? At this point I could have felt good about myself and stopped. But I decided to see if re2c/re2j was a bottleneck too. The two main methods in the token scanner that is used on basically all calls are “document()” and “plain()”. I rewrite these sections by hand. Re2j generate switch statements that in most cases take more than one jump to get to the right place. By doing some thinking on these algorithms I made the shortest path much shorter in these two methods. After fixing document() I got these numbers:

scanning ../jruby_newyaml/bench/big_yaml.yml 10000 times took  11226ms
parsing  ../jruby_newyaml/bench/big_yaml.yml 10000 times took  14059ms

And after fixing plain() I got it down to these:

scanning ../jruby_newyaml/bench/big_yaml.yml 10000 times took   9581ms
parsing  ../jruby_newyaml/bench/big_yaml.yml 10000 times took  12838ms

At this point I had a hunch, and decided to check the performance of the implicit-scanner. The first thing I did was write tests to check how fast it was, and compare it with JvYAMLb and Syck. My hunch was right, the re2j-based implicit-scanner was 10 times slower than the equivalent in Syck. The implicits are called during scanning and parsing when a scalar node is done, so I thought it might contribute to the performance. But instead of rewriting it by hand, I decided to move from re2j to Ragel. Since the implicit scanner is pretty limited, this was a move that worked there, but would never have worked for the token scanner. Ragel generates finite state machines in tables, and presto – it turned out to work like a charm. After rewriting the implicit scanner I ended up with these numbers:

scanning ../jruby_newyaml/bench/big_yaml.yml 10000 times took   4804ms
parsing  ../jruby_newyaml/bench/big_yaml.yml 10000 times took   8424ms

And that’s where I decided to stop. I’m not sure what the moral of this tale is, except to generate as much as possible, be careful with switch-statements on hotspot, and rewrite by hand the pieces that really matter.

Also, use the right tool for the job. JACC was obviously not right for me, but Jay seems to be. Re2j is still right for some parts of the token scanner, while Ragel is right for the implicit-scanner. Of course, using the right tool for the job means knowing the alternatives. I knew about Ragel and I know the implementation of both re2j and Ragel, so I could make educated decisions based on their characteristics.



SuperRedCloth


RedCloth is cool. SuperRedCloth is cooler, faster, nicer, and all in all such a treat. And it works on JRuby now. At the time of writing, I haven’t really wired together a gem of it yet, but I hope to get it done soon. Until then, if you want to try this, you can download the necessary jar-file from http://opensource.ologix.com/superredcloth_scan.jar, install the (very small) SuperRedCloth lib-file into your JRuby, add the superredcloth_scan.jar into your site_ruby for JRuby and everything should be set to go. Be warned, this requires a very recent trunk version of JRuby (something like newer than April 1:st).

SuperRedCloth is another of those nice Ragel based libraries. I’m finding Ragel such a joy to work with; just change the few C-thingies into Java-thingies and everything works exactly the same. Neat.



Ragel performance


I did some performance testing on the old and new Resolver implementation. The testing have some stupid tests that exercise bad parts of both implementations (like longest match, where it can’t be decided what type something is until we have to backtrack about 20 characters). I placed these 24 strings in an array, and pounded on it with an instance of the ResolverImpl that is used in exactly the same way on all scalar values in an YAML document. The objective is to find out if the value is an implicit type or not. So basically, we give it a String, and get back a tag URI. So it’s not like I’m parsing a language or anything. I’m just doing some recognizing here.

The old implementation was based on a Map> where the first letter of the string to resolve was used as an index to find a list of patterns to try sequentially. This worked fine, and made it extensible. But not very fast. This is the baseline. For 24 different strings, iterated 100 000 times for 2 400 000 resolves it takes 7879ms. That’s OK, but not great.

Now, the new Ragel implementation is dead simple. It’s just a translation of the regexps in the aforementioned Pattern’s into a state machine. At EOF out actions (%/ for people in the Ragel knowhow), I execute an action that sets a local variable to a tag, and at the end of the resolve method returns that tag. Dead simple, and not exercising the full strength of Ragel, of course.
So, for the same number of resolves, this ResolverImpl takes 1288ms. That’s 611% improvement in speed. Ain’t it nice to have a friend such as Ragel? And the best part is, for harder tasks, these improvements would be even larger.
Finite State Machines are your friends. All your base are belongs to us.



Hpricot goodness


This is just so cool, I cannot contain it. For those of you who haven’t heard about Hpricot, it is one of why the lucky stiff‘s incredibly cool tools (which he probably will use to take over the world any day now…). It’s HTML parsing goodness, very flexible, with the goal of being able to parse (and fix) everything that Firefox handles.

“So what?” you’re probably asking… Well, Hpricot uses Ragel and some C code to achieve blinding speed. This means JRuby can’t run it. Or I should say couldn’t run it:


orpheus:~/workspace/jruby> jruby bin/gem install hpricot --source http://code.whytheluckystiff.net
Bulk updating Gem source index for: http://code.whytheluckystiff.net
Select which gem to install for your platform (java)
1. hpricot 0.5.110 (jruby)
2. hpricot 0.5.110 (mswin32)
3. hpricot 0.5.110 (ruby)
4. hpricot 0.5 (ruby)
5. hpricot 0.5 (mswin32)
6. hpricot 0.5.0 (ruby)
7. hpricot 0.5.0 (mswin32)
8. hpricot 0.4.99 (ruby)
9. hpricot 0.4.99 (mswin32)
10. hpricot 0.4.92 (ruby)
11. hpricot 0.4.92 (mswin32)
12. Skip this gem
13. Cancel installation
> 1
Successfully installed hpricot-0.5.110-jruby
Installing ri documentation for hpricot-0.5.110-jruby...
Installing RDoc documentation for hpricot-0.5.110-jruby...

That’s right, Hpricot is now more promiscuous than any other gem with native parts.
What can you do with it? Well, I’m just going to point you to _why’s own description of it. All he says at http://code.whytheluckystiff.net/hpricot/ will work fine in JRuby!

How did this come to be? Well, me and _why did some joint hacking, which was helped along by the fact that Adrian Thurston (the genius behind Ragel) recently added Java support to it. So, basically, most of the Ragel definition is exactly the same for both the C and the Java versions. The native code has been factored out, and both versions are buildable with rake from _why’s code repository.

This is important. Don’t think anything else. This strategy will, and can, be used for other gems with native parts. It’s just a question of time.