Porting Syck – A story of C, parser generators and general ugliness

As mentioned earlier, a few weeks back I decided to port Syck to Java, for the sake of JRuby. I will detail some interesting experiences in this blog post.

First some introductions. Syck is broadly divided into two pieces – the core library and the language adaptations. The language adaptions are the stuff that is specific to each language and provides the language level APIs. I won’t talk that much about this side of things – most of this post is about the core library.

Any YAML processor is divided into a parser and a emitter. The Syck emitter is pretty straightforward, and so is the Yecht port of it. The parser on the other hand need to be able to do several different things. First of all, it need to be able to handle the fact that YAML is context dependent, which means you can’t do it with a typical parser generator. So the way all YAML engines do it is by having a pretty smart scanner/tokenizer, and then a straight forward parser on top of this. The scanner takes care of the context dependent bits (which are mainly regarding indentation) and abstracts away this so the parser can just make sure to put documents together.

Another piece that is generally necessary is something that checks what type a value is of. Since YAML supports several different core types, and a YAML engine should always strive to read things without needing hints, the processor need to know that the YAML scalar “foo” has tag “tag:yaml.org,2002:str” while “42” has tag “tag:yaml.org,2002:int”. Of course there are many more types, including several different versions of integers, floats and timestamps. All of this handling is incidental to the parsing of the document, though. A scalar is a scalar no matter what type it is. So the recognizing of implicits is generally not done in the scanner or the parser, but in a separate scanner.

Syck uses re2c for the token scanner and the implicit scanner, and it uses Bison for the parser. My first version of Yecht was as straight forward as I could make it. I ported the backend of re2c to generate Java, and then I used JACC instead of Bison. So far all good. But when it was done, this initial version was pretty slow. I benchmarked the core library by supplying a nodehandler that didn’t do anything, and the did the same thing for Syck. I also added a small piece that just ran the scanner piece. And what I saw got me a bit depressed. I had gotten Yecht to be totally compatible with Syck, but it was buttslow. For comparison I used a random big YAML document without any real advanced features. Just a combination of mappings, sequences and scalars. For reference, the Syck numbers were these:

scanning ../jruby_newyaml/bench/big_yaml.yml 10000 times took  13959ms
parsing  ../jruby_newyaml/bench/big_yaml.yml 10000 times took  16101ms

And JvYAMLb (the engine I was replacing in JRuby):

scanning ../jruby_newyaml/bench/big_yaml.yml 10000 times took   5325ms
parsing  ../jruby_newyaml/bench/big_yaml.yml 10000 times took  15794ms

And my first version of Yecht:

scanning ../jruby_newyaml/bench/big_yaml.yml 10000 times took  93658ms
parsing  ../jruby_newyaml/bench/big_yaml.yml 10000 times took 117213ms

Ouch. The scanner is 7 times slower, and so is the parsing. And comparing with JvYAMLb, it looks even worse. What’s going on here?

As it happens, I didn’t tell you exactly how I ported the token scanner. The Syck implementation of this used about 10 different gotos to jump between different cases. Now, that’s a pretty fast operation in C. Using Java, I had to do something different. Since I wanted to make the first version of Yecht as close a port as possible, I decided to use a standard transformation to mimic the gotos. In Java you can do this by wrapping the area with a while(true) loop that has a label – and then you can use a variable that contains a number to indicate which goto point to go to next. The actual code lives in a switch statement inside of the while-loop. This code is a bit ugly but if you define some constants you end up with code that looks reasonable much like the C code. Just replace “goto plain3;” with “gotoPoint = plain3; break gotoNext;”.

The first thing I did after finding everything was slow was to try to pinpoint the place where all the performance disappeared. This is actually surprisingly hard. But in this case it turned out to be yylex, which contains the full token scanning logic including my homegrown gotos. Then Charles reminded me about one of those lessons we have learned while developing JRuby – namely that Hotspot really doesn’t like large switch statements. So my first step was to try to untangle the logic of all the gotos.

That was actually not that hard, since the logic bottomed out quite quickly. I managed to replace the switch-based goto-logic with separate methods that called each other instead. That was the only thing I did, and suddenly the scanner wasn’t a problem anymore. These were the numbers after I removed that goto-logic:

scanning ../jruby_newyaml/bench/big_yaml.yml 10000 times took  12207ms
parsing  ../jruby_newyaml/bench/big_yaml.yml 10000 times took  31280ms

Yes, you’re reading that right. The scanning process became 7 times faster by just removing that switch statement and making it into separate methods. Hotspot really doesn’t like large switch statements, and it does like smaller methods. In this case it also made it easier to find the methods that were hotspots in the scanner too.

After this fix it was obvious that the parsing component was the main problem now. Since the approach of removing large switch statements had worked so well in the scanner, I decided to try the same approach in the parser. The parser generator I used – JACC – happens to be one of those generators that generates only code, no tables. Once upon a time this was probably the right behavior for Java, to get good performance, but it’s definitely not the right choice anymore. So I switched from JACC to Jay, and ended up with these numbers:

scanning ../jruby_newyaml/bench/big_yaml.yml 10000 times took  12960ms
parsing  ../jruby_newyaml/bench/big_yaml.yml 10000 times took  15411ms

Nice, huh? At this point I could have felt good about myself and stopped. But I decided to see if re2c/re2j was a bottleneck too. The two main methods in the token scanner that is used on basically all calls are “document()” and “plain()”. I rewrite these sections by hand. Re2j generate switch statements that in most cases take more than one jump to get to the right place. By doing some thinking on these algorithms I made the shortest path much shorter in these two methods. After fixing document() I got these numbers:

scanning ../jruby_newyaml/bench/big_yaml.yml 10000 times took  11226ms
parsing  ../jruby_newyaml/bench/big_yaml.yml 10000 times took  14059ms

And after fixing plain() I got it down to these:

scanning ../jruby_newyaml/bench/big_yaml.yml 10000 times took   9581ms
parsing  ../jruby_newyaml/bench/big_yaml.yml 10000 times took  12838ms

At this point I had a hunch, and decided to check the performance of the implicit-scanner. The first thing I did was write tests to check how fast it was, and compare it with JvYAMLb and Syck. My hunch was right, the re2j-based implicit-scanner was 10 times slower than the equivalent in Syck. The implicits are called during scanning and parsing when a scalar node is done, so I thought it might contribute to the performance. But instead of rewriting it by hand, I decided to move from re2j to Ragel. Since the implicit scanner is pretty limited, this was a move that worked there, but would never have worked for the token scanner. Ragel generates finite state machines in tables, and presto – it turned out to work like a charm. After rewriting the implicit scanner I ended up with these numbers:

scanning ../jruby_newyaml/bench/big_yaml.yml 10000 times took   4804ms
parsing  ../jruby_newyaml/bench/big_yaml.yml 10000 times took   8424ms

And that’s where I decided to stop. I’m not sure what the moral of this tale is, except to generate as much as possible, be careful with switch-statements on hotspot, and rewrite by hand the pieces that really matter.

Also, use the right tool for the job. JACC was obviously not right for me, but Jay seems to be. Re2j is still right for some parts of the token scanner, while Ragel is right for the implicit-scanner. Of course, using the right tool for the job means knowing the alternatives. I knew about Ragel and I know the implementation of both re2j and Ragel, so I could make educated decisions based on their characteristics.

Ioke’s need for speed

I find it interesting that one of the expectations on Ioke is that it will be faster than Io. I need to make this really, obviously clear. Ioke is not even close to the speed of Io. Ruby is a blazing arrow compared to Ioke. I know this, and I’ve made a clear design decision to not take performance into consideration at all right now.

There are several ways to think about this. I’m creating Ioke for several reason. One of them is to see how far you can push expressability in a language. Can you take the blub hierarchy another level? To be able to do this, it’s impossible to be concerned with speed.

I realize that if Ioke someday becomes a successful language, performance will be more of an issue. But there are several implicit assumptions in that statement. The first one is that it will be successful at all. After all, most languages fail. If Ioke fails to become successful enough (for any definition of the term), it will almost certainly not be because of lack of speed. And if lack of speed is the only problem, well, that can be fixed by focusing totally on performance after the fact. But I feel that it is much more important to first get the model correct, and then optimize, rather than the other way around.

Programmers all know the old aphorism about the dangers of premature optimization. But we still do it. All the time. It is really hard for me to try to avoid it in Ioke. But I think I’m succeeding, because Ioke is really dead slow right now.

Ioke does need speed. But it doesn’t speed right now. There are many ifs in the success of a language, but if all of them are fulfilled – well, then maybe in a year or two a focus on speed will come. Ioke is a malleable language, and it will be possible to do all kinds of tricks with the language to make performance improvement happen. But that is a consequence of the runtime model. The fact that the language is really flexible makes it slow now, but will make it easier to improve performance on later.

Expressive power vs performance

We all know that all programming languages in use today are Turing complete. But what this comes down to in reality is Greenspun’s Tenth Rule, which isn’t that useful for common discourse. So I will ignore Turing completeness in this post, and argue as if some languages are absolutely more powerful than others in terms of expressiveness. How to measure that power is a different discussion, and something Paul Graham and many others have written about. So I’ll make it really simple here – I’m going to use my own subjective view of the power of different programming languages, where succinctness (ability to compress without losing readability, a hard thing indeed), high classes of abstractions and support for different paradigms are supported. Not surprisingly, this put languages like Lisp, Smalltalk and Haskell in the absolute top, while languages like Java, C#, Basic and many other end up further down the scale to different degrees. Assembler is somewhere near the absolute bottom.

My contention is that expressive power is the absolutely most important property to focus on right now. We have progressively gotten better and better at this, but it still feels as if many decisions in programming languages are still based on whether they can be implemented efficiently or not. That is definitely important in some circumstances, but maybe not as often as we think. As we know, programmers are prone to premature optimization, and I think this is true for programming language design too.

One of the core guiding principles for Ioke is to forcefully remove all concerns about performance when thinking about new abstraction capabilities. I think there are new abstractions, new ways of programming – or even just incrementally better ways of doing things – that aren’t being explored enough. Take something like Ruby’s variables and instance variables, which spring into use the first time they are assigned to. This is highly useful, and lead to very neat code, but it’s hell to implement efficiently. And the situation can be even worse, like what I do in Ioke, where it’s not possible to statically/lexically see if a variable assignment is the first or not, and where it belongs. This means that there is no way to create efficient closure structures, no way to create efficient storage mechanisms for local scope, and so on. Some guesses can be made, but in the general cases it’s hard to make it perform well.

So why do it? Well, we’ve been pretty good at creating medium expressive languages that perform well, and we’re really good at designing languages that aren’t expressive at all while performing very well. But I’m lacking the other end of the scale. The most expressive languages I mentioned above still make concessions to performance in many places. Just take something like let-bindings in Lisp, which actually turns out to be really easy to compile well. Lexical closures is also not too bad from an efficiency stand point. Actually, lexical closures are way more efficient than the dynamic scope most Lisps used to have once upon a time. Was the decision to move away from dynamic scope driven by performance? No, not entirely. There were other good reasons to abandon it – but note that Common Lisp and Emacs Lisp still contain it.

So. I hope there are new things to be found in terms of expressiveness. I hope I can find some of them with Ioke. One simple thing is the control over scoping that I’ve mentioned before. Another is implementing local variables as cells on objects, just like all other properties in Ioke. A third is allowing both macros and syntax. Another is the ‘it’ variable in if statement. Another is removing any kind of immediate objects (so numbers and symbols are simple objects just like everything else in Ioke. You can override a method on one specific symbol instance of you want, which isn’t really possible in Ruby.) Another is removing all floats from the language. That data type just doesn’t have a good purpose in Ioke.

But this is just beginning. I hope to see new ways of doing stuff later on. But keep in mind, whenever I talk about some feature of Ioke, I will try to disregard the performance cost of it. Really.

Current state of Regular Expressions

As I’ve made clear earlier, the current regular expression situation has once again become impractical. To reiterate the history: We began with regular Java regex support. This started to cave in when we found out that the algorithm used is actually recursive, and fails for some common regexps used inside Rails among others. To fix that, we integrated JRegex instead. That’s the engine 1.0 was released with and is still the engine in use. It works fairly well, and is fast for a Java engine. But not fast enough. In particular, there is no support for searching for exact strings and failing fast, and the engine requires us to transform our byte[]-strings to char[] or String. Not exactly optimal. Another problem is that compatibility with MRI suffers, especially in the multi byte support.

There are two solutions currently on the way. Core developer Marcin are working on a port of the 1.9 regexp engine Oniguruma. This port still has some way to go, and is not integrated with JRuby. The other effort is called REJ, and is a port of the MRI engine I did a few months back. I’ve freshened up the work and integrated it with JRuby in a branch. At the moment this work actually seems to go quite well, but there are some snags.

First of all, let me point out that this approach gives us more or less total multibyte compatibility for 1.8, which is quite nice.

When doing benchmarking, I’m generally using Rails as the bar. I have a series of regular expressions that Petstore uses for each requst, and I’m using these to check performance. As a first datapoint, JRuby+REJ is faster at parsing regexps than JRuby trunk for basically all regexps. This ranges from slightly faster to twice as fast.

Most of the Rails regexen are actually faster in REJ than in JRuby+trunk, but the problem is that some of them are actually quite a bit slower. 4 of the 22 Rails regexps are slower, by between 20 and 250% percent. There are also this one: /.*_f/ =~ “_fxxxxxxxxxxxxxxxxxxxxxxx” which basically runs about 10x slower than JRuby trunk. Not nice at all.

In the end, the problem is backtracking. Since REJ is a straight port of the MRI code, the backtracking is also ported. But it seems that Java is unusually bad at handling that specific algorithm, and it performs quite badly. At the moment I’m continuing to look at it and trying to improve performance in all ways possible, so we’ll see what happens. Charles Nutter have also started to look at it.

But what’s really interesting is that I reran my Petstore benchmarks with the current REJ code. To rehash, my last results with JRuby trunk looked like this:

controller :   1.804000   0.000000   1.804000 (  1.804000)
view : 5.510000 0.000000 5.510000 ( 5.510000)
full action: 13.876000 0.000000 13.876000 ( 13.876000)

But the results from rerunning with REJ was interesting, to say the least. I expected bad results because of the bad backtracking performance, but it seems the other speed improvements weigh up:

controller :   1.782000   0.000000   1.782000 (  1.782000)
view : 4.735000 0.000000 4.735000 ( 4.735000)
full action: 12.727000 0.000000 12.727000 ( 12.727000)

As you can see, the improvement is quite large in the view numbers. It is also almost there compared to MRI which had 4.57. Finally, the full action is better by a full second too. Again, MRI is 9.57s and JRuby 12.72. It’s getting closer. I am quite optimistic right now, provided that we manage to fix the remaining problems with backtracking, our regexp engine might well be a great boon to performance.

Updated JRuby on Rails performance numbers

So, after my last post, several of us have spent time looking at different parts of Rails and JRuby performance. We have managed to improve things quite nicely since my last performance note. Some of the things changed have been JRuby’s each_line implementation, JRuby’s split implementation, a few other improvements, and some small fixes to AR-JDBC. After that, here are some new numbers for Petstore. Remember, the MRI numbers are for MRI 1.8.6 with native C MySQL. The JRuby numbers is with ActiveRecord-JDBC trunk and MySQL. (I’m only showing the best numbers of each). The number in bold is the most important one for comparison.

MRI   controller :   1.000000   0.070000   1.070000 (  1.430260)
JRuby controller : 1.804000 0.000000 1.804000 ( 1.804000)

MRI view : 4.410000 0.150000 4.560000 ( 4.575399)
JRuby view : 5.510000 0.000000 5.510000 ( 5.510000)

MRI full action: 8.260000 0.410000 8.670000 ( 9.574460)
JRuby full action: 13.876000 0.000000 13.876000 ( 13.876000)

As you can see, we are talking about 9.5s MRI to 13.8s for JRuby, which I find is a quite nice achievement if you look at the numbers from Friday. We are inching closer and closer. Both the view and the controller numbers are looking very nice. This is actually indicative of a nice trend – since general JRuby primitive performance is really good, the slowness in our Regular Expression engine is weighed up by much faster execution speed.

Once the port of Oniguruma lands, this story will almost certainly look very different. But even so, this is looking good.

JRuby discovery number one

After my last entry I’ve spent lots of time checking different parts of JRuby, trying to find the one true bottleneck for Rails. Of course, I still haven’t found it (otherwise I would have said YAY in the subject for this blog). But I have found a few things – for example, symbols are slow right now, but Bill’s work will make them better. And it doesn’t affect Rails performance at all.

But the discovery I made was when I looked at the performance of the regular expressions used in Rails. There are exactly 50 of them for each request, so I did a script that checked the performance of each of them against MRI. And I found that there was one in particular that had really interesting performance when comparing MRI to JRuby. In fact, it was between 200 and a 1000 times slower. What’s worse, the performance wasn’t linear.

So which regular expression was the culprit? Well, /.*?\n/m. That doesn’t look to bad. And in fact, this expression displayed not one, but two problems with JRuby. The first one is that any regular expression engine should be able to fail fast on something like this, simply because there is a string that always needs to be part of a string for this expression to match. In MRI, that part of the engine is called bm_search, and is a very fast way to fail. JRuby doesn’t have that. Marcin is working on a port of Oniguruma though, so that will fix that part of the problem.

But wait, if you grep for this regexp in the Rails sources you won’t find it. So where was it actually used? Here is the kicker: it was used in JRuby’s implementation of String#each_line. So, let’s take some time to look at a quick benchmark for each_line:

require 'benchmark'

str = "Content-Type: text/html; charset=utf-8\r\nSet-Cookie: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa "


puts "each_line on small string with several lines"
10.times do
puts(Benchmark.measure{TIMES.times { str.each_line{} }})

str = "abc" * 15

puts "each_line on short string with no line divisions"
10.times do
puts(Benchmark.measure{TIMES.times { str.each_line{} }})

str = "abc" * 4000

puts "each_line on large string with no line divisions"
10.times do
puts(Benchmark.measure{TIMES.times { str.each_line{} }})

As you can see, we simple measure the performance of doing a 100 000 each_line calls on three different strings. The first one is a short string with several newlines, the second is a short string with no newlines, and the last is a long string with no newlines. How does MRI run this benchmark?

each_line on small string with several lines
0.160000 0.000000 0.160000 ( 0.157664)
0.150000 0.000000 0.150000 ( 0.160450)
0.160000 0.000000 0.160000 ( 0.171563)
0.150000 0.000000 0.150000 ( 0.157854)
0.150000 0.000000 0.150000 ( 0.154578)
0.150000 0.000000 0.150000 ( 0.154547)
0.160000 0.000000 0.160000 ( 0.158894)
0.150000 0.000000 0.150000 ( 0.158064)
0.150000 0.010000 0.160000 ( 0.156975)
0.160000 0.000000 0.160000 ( 0.156857)
each_line on short string with no line divisions
0.080000 0.000000 0.080000 ( 0.086789)
0.090000 0.000000 0.090000 ( 0.084559)
0.080000 0.000000 0.080000 ( 0.093477)
0.090000 0.000000 0.090000 ( 0.084700)
0.080000 0.000000 0.080000 ( 0.089917)
0.090000 0.000000 0.090000 ( 0.084176)
0.080000 0.000000 0.080000 ( 0.086735)
0.090000 0.000000 0.090000 ( 0.085536)
0.080000 0.000000 0.080000 ( 0.084668)
0.090000 0.000000 0.090000 ( 0.090176)
each_line on large string with no line divisions
3.350000 0.020000 3.370000 ( 3.404514)
3.330000 0.020000 3.350000 ( 3.690576)
3.320000 0.040000 3.360000 ( 3.851804)
3.320000 0.020000 3.340000 ( 3.651748)
3.340000 0.020000 3.360000 ( 3.478186)
3.340000 0.020000 3.360000 ( 3.447704)
3.330000 0.020000 3.350000 ( 3.448651)
3.350000 0.010000 3.360000 ( 3.489842)
3.350000 0.020000 3.370000 ( 3.429135)
3.350000 0.010000 3.360000 ( 3.372925)

OK, this looks reasonable. The large string is obviously taking more time to search, but not incredibly much time. What about trunk JRuby?

each_line on small string with several lines
32.668000 0.000000 32.668000 ( 32.668000)
30.785000 0.000000 30.785000 ( 30.785000)
30.824000 0.000000 30.824000 ( 30.824000)
30.878000 0.000000 30.878000 ( 30.877000)
30.904000 0.000000 30.904000 ( 30.904000)
30.826000 0.000000 30.826000 ( 30.826000)
30.550000 0.000000 30.550000 ( 30.550000)
32.331000 0.000000 32.331000 ( 32.331000)
30.971000 0.000000 30.971000 ( 30.971000)
30.537000 0.000000 30.537000 ( 30.537000)
each_line on short string with no line divisions
7.472000 0.000000 7.472000 ( 7.472000)
7.350000 0.000000 7.350000 ( 7.350000)
7.516000 0.000000 7.516000 ( 7.516000)
7.252000 0.000000 7.252000 ( 7.252000)
7.313000 0.000000 7.313000 ( 7.313000)
7.262000 0.000000 7.262000 ( 7.262000)
7.383000 0.000000 7.383000 ( 7.383000)
7.786000 0.000000 7.786000 ( 7.786000)
7.583000 0.000000 7.583000 ( 7.583000)
7.529000 0.000000 7.529000 ( 7.529000)
each_line on large string with no line divisions

Ooops. That doesn’t look so good… And also, where is the last ten lines? Eh… It’s still running. It’s been running for two hours to produce the first line. That means that it’s taking at least 7200 seconds which is more than 2400 times slower than MRI. But in fact, since the matching of the regular expression above is not linear, but exponential in performance, I don’t expect this to ever finish.

There are a few interesting lessons to take away from this exercise:

  1. There may still be implementation problems like this in many parts of JRuby – performance will improve by quite much every time we find something like this. I haven’t measured Rails performance after this is fixed, and I don’t expect it to actually fix the whole problem, but I think I’ll see better numbers.
  2. Understand regular expressions. Why is /.*?\n/ so incredibly bad for strings over a certain length? In this case it’s the combination of .* and ?. What would be a better implementation in almost all cases? /[^\n]\n/. Notice that there is no backtracking in this implementation, and because of that, this regexp will have performance O(n) while the earlier one was O(n^2). Learn and know these things. They are the difference between usage and expertise.

Mystery: An exposé on JRuby performance

So, after Charlies awesome work yesterday (documented here), I felt it was time I put the story straight on general JRuby performance. It’s something of a mystery. I’ll begin by showing you the results of lots of different benchmarks and comparison to MRI. All these benchmarks have been run on my MacBook Pro, dual 2.33Ghz cores with 2Gb memory. JRuby revision is 4578, Java is Apple 1.6 beta, and Ruby version is 1.8.6 (2007-03-13 patchlevel 0). JRuby was ran with -J-server and -O.

Now, lets begin with the YARV benchmarks. The first values is always MRI, the second i JRuby. I’ve made the benchmark red when MRI is faster, and green when JRuby is faster.

0.480000 0.010000 0.490000 ( 0.511641)
0.479000 0.000000 0.479000 ( 0.478000)

ERROR stack level too deep
2.687000 0.000000 2.687000 ( 2.687000)

5.880000 0.030000 5.910000 ( 6.151324)
2.687000 0.000000 2.687000 ( 2.687000)

1.930000 0.010000 1.940000 ( 1.992984)
2.752000 0.000000 2.752000 ( 2.876000)

84.160000 0.450000 84.610000 ( 88.205519)
77.117000 0.000000 77.117000 ( 77.117000)

2.600000 0.430000 3.030000 ( 3.169156)
2.162000 0.000000 2.162000 ( 2.162000)

1.390000 0.010000 1.400000 ( 1.427766)
1.003000 0.000000 1.003000 ( 1.003000)

7.570000 0.060000 7.630000 ( 7.888381)
2.676000 0.000000 2.676000 ( 2.676000)

6.020000 0.030000 6.050000 ( 6.186971)
2.236000 0.000000 2.236000 ( 2.236000)

4.240000 0.020000 4.260000 ( 4.404826)
3.354000 0.000000 3.354000 ( 3.354000)

9.450000 0.050000 9.500000 ( 9.678552)
5.037000 0.000000 5.037000 ( 5.037000)

1.890000 0.010000 1.900000 ( 1.936502)
1.039000 0.000000 1.039000 ( 1.039000)

ERROR stack level too deep
4.928000 0.000000 4.928000 ( 4.927000)

5.580000 0.020000 5.600000 ( 5.709101)
5.552000 0.000000 5.552000 ( 5.552000)

1.580000 0.040000 1.620000 ( 1.647592)
1.602000 0.000000 1.602000 ( 1.602000)

4.170000 0.390000 4.560000 ( 4.597234)
4.683000 0.000000 4.683000 ( 4.683000)

0.970000 0.030000 1.000000 ( 1.036678)
0.814000 0.000000 0.814000 ( 0.814000)

1.700000 0.010000 1.710000 ( 1.765739)
1.878000 0.000000 1.878000 ( 1.879000)

5.130000 0.020000 5.150000 ( 5.258066)
4.661000 0.000000 4.661000 ( 4.661000)

5.480000 0.030000 5.510000 ( 5.615154)
3.095000 0.000000 3.095000 ( 3.095000)

1.760000 0.010000 1.770000 ( 1.806116)
1.495000 0.000000 1.495000 ( 1.495000)

0.680000 0.010000 0.690000 ( 0.705296)
0.853000 0.000000 0.853000 ( 0.853000)

19.920000 0.110000 20.030000 ( 20.547236)
12.698000 0.000000 12.698000 ( 12.698000)

15.720000 0.080000 15.800000 ( 16.426734)
7.654000 0.000000 7.654000 ( 7.654000)

14.530000 0.070000 14.600000 ( 15.137106)
7.588000 0.000000 7.588000 ( 7.589000)

17.230000 0.090000 17.320000 ( 20.406438)
6.415000 0.000000 6.415000 ( 6.416000)

11.520000 0.040000 11.560000 ( 11.736435)
5.604000 0.000000 5.604000 ( 5.603000)

17.560000 0.080000 17.640000 ( 18.178065)
5.413000 0.000000 5.413000 ( 5.413000)

22.160000 0.110000 22.270000 ( 22.698746)
6.836000 0.000000 6.836000 ( 6.835000)

5.600000 0.020000 5.620000 ( 5.675354)
1.844000 0.000000 1.844000 ( 1.844000)

9.800000 0.030000 9.830000 ( 9.918884)
5.152000 0.000000 5.152000 ( 5.151000)

13.570000 0.050000 13.620000 ( 13.803066)
10.289000 0.000000 10.289000 ( 10.289000)

3.990000 0.010000 4.000000 ( 4.071277)
1.750000 0.000000 1.750000 ( 1.749000)

5.670000 0.020000 5.690000 ( 5.723124)
3.267000 0.000000 3.267000 ( 3.267000)

0.380000 0.000000 0.380000 ( 0.387671)
0.961000 0.000000 0.961000 ( 0.961000)

3.720000 0.010000 3.730000 ( 3.748266)
2.135000 0.000000 2.135000 ( 2.136000)

4.100000 0.010000 4.110000 ( 4.138355)
1.781000 0.000000 1.781000 ( 1.781000)

3.320000 0.010000 3.330000 ( 3.348069)
1.385000 0.000000 1.385000 ( 1.385000)

4.810000 0.020000 4.830000 ( 4.856368)
2.920000 0.000000 2.920000 ( 2.921000)

0.000000 0.000000 0.000000 ( 0.006621)
0.368000 0.000000 0.368000 ( 0.368000)

What is interesting about these numbers is that almost all of them are way faster, and the ones that are slower are so by a quite narrow margin (except for the regexp and thread tests).

So, this is the baseline against MRI. Let’s take a look at a few other benchmarks. These can all be found in JRuby, in the test/bench directory. When you run them, all of them generate 5 or 10 runs of all measures, but I’ve simply taken the best one for each. The repetition is to allow Java to warm up. Here are a few different benchmarks, same convention and running parameters as above:

1.390000 0.000000 1.390000 ( 1.412710)
0.532000 0.000000 0.532000 ( 0.532000)

Control: 1m loops accessing a local variable 100 times
2.830000 0.010000 2.840000 ( 2.864822)
0.105000 0.000000 0.105000 ( 0.105000)

Test STI: 1m loops accessing a fixnum var and calling to_i 100 times
10.000000 0.030000 10.030000 ( 10.100846)
2.111000 0.000000 2.111000 ( 2.111000)

Test ruby method: 1m loops calling self's foo 100 times
16.130000 0.060000 16.190000 ( 16.359876)
7.971000 0.000000 7.971000 ( 7.971000)

Test ruby method: 100k loops calling self's foo 100 times
1.570000 0.000000 1.570000 ( 1.588715)
0.587000 0.000000 0.587000 ( 0.587000)

1m loops yielding a fixnum 10 times to a block that just retrieves dvar
2.800000 0.010000 2.810000 ( 2.822425)
1.194000 0.000000 1.194000 ( 1.194000)

1m loops yielding two fixnums 10 times to block accessing one
6.550000 0.030000 6.580000 ( 6.623452)
2.064000 0.000000 2.064000 ( 2.064000)

1m loops yielding three fixnums 10 times to block accessing one
7.390000 0.020000 7.410000 ( 7.467841)
2.120000 0.000000 2.120000 ( 2.120000)

1m loops yielding three fixnums 10 times to block splatting and accessing them
9.250000 0.040000 9.290000 ( 9.339131)
2.451000 0.000000 2.451000 ( 2.451000)

1m loops yielding a fixnums 10 times to block with just a fixnum (no vars)
1.890000 0.000000 1.890000 ( 1.908501)
1.278000 0.000000 1.278000 ( 1.277000)

1m loops calling a method with a fixnum that just returns it
2.740000 0.010000 2.750000 ( 2.766255)
1.426000 0.000000 1.426000 ( 1.426000)

Measure string array sort time
5.950000 0.060000 6.010000 ( 6.055483)
8.061000 0.000000 8.061000 ( 8.061000)

Measure hash put time
0.390000 0.010000 0.400000 ( 0.398593)
0.208000 0.000000 0.208000 ( 0.209000)

Measure hash get time (note: not same scale as put test)
1.620000 0.000000 1.620000 ( 1.646155)
0.740000 0.000000 0.740000 ( 0.740000)

Measure string == comparison time
2.340000 0.010000 2.350000 ( 2.368579)
0.812000 0.000000 0.812000 ( 0.812000)

Measure string == comparison time, different last pos
2.690000 0.000000 2.690000 ( 2.724772)
0.860000 0.000000 0.860000 ( 0.860000)

Measure string <=> comparison time
2.340000 0.010000 2.350000 ( 2.369915)
0.824000 0.000000 0.824000 ( 0.824000)

Measure 'string'.index(fixnum) time
0.790000 0.010000 0.800000 ( 0.808189)
1.113000 0.000000 1.113000 ( 1.113000)

Measure 'string'.index(str) time
2.860000 0.010000 2.870000 ( 2.892730)
0.956000 0.000000 0.956000 ( 0.956000)

Measure 'string'.rindex(fixnum) time
0.800000 0.000000 0.800000 ( 0.817300)
0.631000 0.000000 0.631000 ( 0.631000)

Measure 'string'.rindex(str) time
12.190000 0.040000 12.230000 ( 12.310492)
1.247000 0.000000 1.247000 ( 1.247000)

100k * 100 ivar gets, 1 ivar
0.500000 0.000000 0.500000 ( 0.582682)
0.340000 0.000000 0.340000 ( 0.340000)

100k * 100 ivar sets, 1 ivar
0.700000 0.010000 0.710000 ( 0.816724)
0.402000 0.000000 0.402000 ( 0.401000)

100k * 100 attr gets, 1 ivar
0.970000 0.000000 0.970000 ( 0.988212)
0.875000 0.000000 0.875000 ( 0.874000)

100k * 100 attr sets, 1 ivar
1.390000 0.010000 1.400000 ( 1.406535)
1.114000 0.000000 1.114000 ( 1.114000)

100k * 100 ivar gets, 2 ivars
0.490000 0.000000 0.490000 ( 0.506206)
0.344000 0.000000 0.344000 ( 0.344000)

100k * 100 ivar sets, 2 ivars
0.680000 0.000000 0.680000 ( 0.693064)
0.388000 0.000000 0.388000 ( 0.388000)

100k * 100 attr gets, 2 ivars
0.970000 0.000000 0.970000 ( 0.989313)
0.878000 0.000000 0.878000 ( 0.878000)

100k * 100 attr sets, 2 ivars
1.400000 0.000000 1.400000 ( 1.434206)
1.129000 0.000000 1.129000 ( 1.128000)

100k * 100 ivar gets, 4 ivars
0.490000 0.000000 0.490000 ( 0.502097)
0.340000 0.000000 0.340000 ( 0.340000)

100k * 100 ivar sets, 4 ivars
0.690000 0.000000 0.690000 ( 0.696852)
0.389000 0.000000 0.389000 ( 0.389000)

100k * 100 attr gets, 4 ivars
0.970000 0.010000 0.980000 ( 0.986163)
0.872000 0.000000 0.872000 ( 0.872000)

100k * 100 attr sets, 4 ivars
1.370000 0.010000 1.380000 ( 1.394921)
1.128000 0.000000 1.128000 ( 1.128000)

100k * 100 ivar gets, 8 ivars
0.500000 0.000000 0.500000 ( 0.519511)
0.344000 0.000000 0.344000 ( 0.344000)

100k * 100 ivar sets, 8 ivars
0.690000 0.000000 0.690000 ( 0.710896)
0.389000 0.000000 0.389000 ( 0.389000)

100k * 100 attr gets, 8 ivars
0.970000 0.000000 0.970000 ( 0.987582)
0.870000 0.000000 0.870000 ( 0.870000)

100k * 100 attr sets, 8 ivars
1.380000 0.000000 1.380000 ( 1.400542)
1.132000 0.000000 1.132000 ( 1.132000)

100k * 100 ivar gets, 16 ivars
0.500000 0.000000 0.500000 ( 0.523690)
0.342000 0.000000 0.342000 ( 0.343000)

100k * 100 ivar sets, 16 ivars
0.680000 0.000000 0.680000 ( 0.707385)
0.391000 0.000000 0.391000 ( 0.391000)

100k * 100 attr gets, 16 ivars
0.970000 0.010000 0.980000 ( 1.017880)
0.879000 0.000000 0.879000 ( 0.879000)

100k * 100 attr sets, 16 ivars
1.370000 0.010000 1.380000 ( 1.387713)
1.128000 0.000000 1.128000 ( 1.128000)

100k calls to a method containing 5x a for loop over a 10-element range
0.890000 0.000000 0.890000 ( 0.917563)
0.654000 0.000000 0.654000 ( 0.654000)

All of this is really good, of course. Lots of green. And the red parts are not that bad either.

But here is the mystery. General Rails performance sucks. This test case can be ran by checking out petstore from tw-commons at RubyForge. There is a file called script/console_bench that will run the benchmarks. The two commands ran was: ruby script/console_bench production, and jruby -J-server -O script/console_bench ar_jdbc. With further ado, here are the numbers:

controller :   1.000000   0.070000   1.070000 (  1.472701)
controller : 2.144000 0.000000 2.144000 ( 2.144000)
view : 4.470000 0.160000 4.630000 ( 4.794336)
view : 6.335000 0.000000 6.335000 ( 6.335000)
full action: 8.300000 0.400000 8.700000 ( 9.597351)
full action: 16.055000 0.000000 16.055000 ( 16.055000)

These numbers plainly stink. And we can’t find the reason for it. As you can see, most benchmarks are much better, and there is nothing we can think of that would add up to a general degradation like this. It can’t be IO since these tests use app.get directly. So what is it? There is very likely to be one or two bottlenecks causing this problem, and when we have found it, Rails will most likely blaze along. But profiling haven’t helped yet. We found a few things, but there is still stuff missing. It’s an interesting problem.

My current thesis is that either symbols or regexps are responsible. I’ll spend the day checking that.

On ActiveRecord-JDBC performance

I have been a bit concerned about the performance of our component that connects ActiveRecord with JDBC. Since ActiveRecord demands that every result of a select should be turned into a big array of hashes of strings to strings, I suspected we would be quite inefficient at this, and I wasn’t sure I could put all my faith in JDBC either.

So, as a good developer, I decided to test this, with a very small microbenchmark, to see how bad the situation actually was.

Since I really wanted to check the raw database and unmarshalling performance, I decided to not use ActiveRecord classes, but do executions directly. The inner part of my benchmark execution looks like this:

conn.create_table :test_perf, :force => true do |t|
t.column :one, :string
t.column :two, :string

100.times do
conn.insert("INSERT INTO test_perf(one, two) VALUES('one','two')")

1000.times do
conn.select_all("SELECT * FROM test_perf")

conn.drop_table :test_perf

It is executed with a recent MySQL Community Edition 5 server, locally, with matching JDBC drivers. The MRI tests is run with 1.8.6, and both use ActiveRecord 1.15.3. ActiveRecord-JDBC is a prerelease of 0.2.4, available from trunk. My machine is an IBM Thinkpad T43p, running Debian. It’s 32bit and Java 6.

The results were highly interesting. First, let’s see the baseline: the Ruby results:

      user     system      total        real
7.730000 0.020000 7.750000 ( 8.531013)

Frankly, I wasn’t that impressed with these numbers. I thought Ruby database performance was better. Oh well. The interesting part is the JRuby AR-JDBC results:

      user     system      total        real
6.948000 0.000000 6.948000 ( 6.948000)

WOW! We’re actually faster in this quite interesting test. Not what I had expected at all, but very welcome news indeed. Note that there is still much block overhead in the interpreter, so the results are a little bit skewed in MRI’s favour by this, too.

Ragel performance

I did some performance testing on the old and new Resolver implementation. The testing have some stupid tests that exercise bad parts of both implementations (like longest match, where it can’t be decided what type something is until we have to backtrack about 20 characters). I placed these 24 strings in an array, and pounded on it with an instance of the ResolverImpl that is used in exactly the same way on all scalar values in an YAML document. The objective is to find out if the value is an implicit type or not. So basically, we give it a String, and get back a tag URI. So it’s not like I’m parsing a language or anything. I’m just doing some recognizing here.

The old implementation was based on a Map> where the first letter of the string to resolve was used as an index to find a list of patterns to try sequentially. This worked fine, and made it extensible. But not very fast. This is the baseline. For 24 different strings, iterated 100 000 times for 2 400 000 resolves it takes 7879ms. That’s OK, but not great.

Now, the new Ragel implementation is dead simple. It’s just a translation of the regexps in the aforementioned Pattern’s into a state machine. At EOF out actions (%/ for people in the Ragel knowhow), I execute an action that sets a local variable to a tag, and at the end of the resolve method returns that tag. Dead simple, and not exercising the full strength of Ragel, of course.
So, for the same number of resolves, this ResolverImpl takes 1288ms. That’s 611% improvement in speed. Ain’t it nice to have a friend such as Ragel? And the best part is, for harder tasks, these improvements would be even larger.
Finite State Machines are your friends. All your base are belongs to us.

Results of jvYAMLb

Well, the YAML-based loading is in JRuby trunk. On the way, some parts of the codebase got seriously simplified. Very nice. The final result, with regard to performance, is about 20-30% on speed. But the important gain is in memory usage. The new implementation takes only about one fourth of the memory the original used. So that’s great.

Regarding the Resolver, as I mentioned in the last post, it required a different approach, since regular JvYAML uses regular expressions to recognize implicit tags. Since that approach isn’t good with byte arrays, I decided to use Ragel to generate a recognizer. That approach was very successful. As soon as I got that working it was the obvious approach. Ragel is good. Ragel is great. Ragel is wonderful. I will use the same approach for regular JvYAML to get away from all those Java regexps.

So, next step will be to do the same conversion of the emitter. Of course, at that point performance isn’t that important. It’s more about memory usage and the need to get away from another external dependency in JRuby.