ThoughtWorks JRuby Geek Night in Pune

I should really have blogged this earlier, but the last few weeks have been hectic. Anyway, better late then never, eh?

Tonight – that is Wednesday, March 24th – ThoughtWorks Pune will host a geek night where I will talk about JRuby. I will talk a bit about what’s coming in the upcoming 1.5 release, and other fun things happening in JRuby land.

If you’d like to come, you can find more information and registration here.


I will speak at ÜberConf in Denver in June. Should be lots of fun! I will talk about JRuby and building languages. I might also possibly cover Ioke – we’ll see what happens.

Conference Hat Trick – QCon, RubyConf, JRubyConf

I’ve just come back from several different conferences. It’s been tiring but also very rewarding. The conferences I attended and presented at was QCon San Francisco, RubyConf and JRubyConf. I thought I’d just mention some of the highlights from these three events.

First QCon – after JAOO, QCon is my favorite conference. They always manage to put together an interesting week with great speakers and lots of things to learn. This year, me and Martin Fowler did a full-day tutorial about domain specific languages.

During the Wednesday I spent most of my time hanging out and chatting with people. I did attend Josh Blochs and Bob Lee’s Java Puzzler presentation. This is always an entertaining hour. I also enjoyed Douglas Crockfords keynote about the history and future of JavaScript. Hearing how this all happened is always enlightening.

On the Thursday I had my track about languages. I think it went very well, my speakers did a great job. Eishay Smith talk about Scala, Stu Halloway about Clojure, Martin Fowler about Ruby, Jonathan Felch about Groovy and Amanda Laucher and Josh Graham about F#. I’m very happy with how it went, actually.

During Friday I mostly sat in on Neal Fords DSL track. My colleague Brian Guthrie started out with a strong hour about internal DSLs in various languages. Ioke got a few code examples, which was fun. After that Neal and Nate Schutta talked about MPS. I haven’t seen this much detail about MPS before so it was helpful.

After lunch Don Box and Amanda Laucher did a talk about the technology formerly known as Oslo. I didn’t think this tech was anything cool at all until I saw this presentation. In retrospect this was probably my favorite presentation of the conference. What came together was how you can use M as a fully typed language with some interesting characteristics, and also the extremely powerful debug features. It’s nice indeed.

Glenn Vanderburg put forward some arguments against language workbenches. This made for an interesting hour but I’m not entirely sure I buy his arguments. And after that Magnus Christerson from Intentional showcased what they’ve been working on lately. Very impressive stuff as usual.

I only spent one day at RubyConf, but it was still enough to get a feeling for what was going on, spend some time with several people I haven’t met before and so on. Good times. Charles Nutter did a very good presentation about his Ruby mutants (Duby and Surinx). After that Ryan Davis and Aaron Patterson did a hilarous presentation about weird software.

JRubyConf was a total success. All of the presentations were very interesting, and provided insight into what people liked about JRuby and what they wanted from it. It was fantastic to see so many people come together just for JRuby. It’s great to be part of that. I did a presentation about testing with JRuby, and then I was part of the closing panel. Both went well.

All in all a great week of conferences.

QCon San Francisco, RubyConf and JRubyConf

I’m gearing up for the next conference stretch. This time it’s San Francisco next week, and I really hope to see lots of people at these conferences – they are gearing up to be something special.

First QCon San Francisco. Except for JAOO, QCon is the best general developer conference I’ve ever been to. Go check out the schedule at This year I’m very excited about doing a full day tutorial about domain specific languages together with Martin Fowler.

I’m also in charge of the languages track, where I have five people who will talk about their experiences with different languages. This time there will not be much introduction to the languages, but instead experience reports, objective descriptions of what worked, what didn’t work and how you can improve your chances of success. The languages covered are Scala, Clojure, Ruby, Groovy and F#. Should be great fun.

Hopefully I will have lots of time to see other presentations too. There are many I would love to see. ThoughtWorks also happens to be a sponsor of QCon, so there will be a booth where it’s a big possibility you can find me or my colleagues.

I will do one day of RubyConf – the Saturday. Funnily enough I haven’t ever been to RubyConf, so I’m looking forward to this too.

Finally, the first ever JRubyConf will happen next Sunday. The program looks really interesting. I’m going to be talking about testing, and also be part of the ending JRuby Core Team panel.

I’m very excited about these conferences. Hope to see you there!

Tutorial at JAOO about JRuby testing

Just thought I’d mention it here – I’m at JAOO this year and will give a tutorial about testing Java with JRuby. I will be a great tutorial and I hope to see many of you there.

Charles, Tom and Nick to EngineYard – and the future of JRuby

Most people have already heard the news that Charles, Tom and Nick are going to Engine Yard to work on JRuby. I’ve been asked for my opinion by a few people, and I’ve also seen some common reactions that I would like to comment on. Of course I only speak for myself, not for Charles, Tom or Nick, and definitely not for neither Sun, Oracle or Engine Yard.

Lets get the congratulations in order first. This is great news for Charles, Tom and Nick, and I definitely wish them well with at their new work. I totally understand their move and would have done the same thing if I had been in the same situation.

This is also good news for the JRuby project. The main concern from Charles and company has been to ensure that the JRuby project doesn’t suffer – that has been the overriding concern in this decision. Of course, having Nick be able to work on JRuby proper will also be great. Another full time resource.

Now for some of the comments and worries. Tim Anderson writes in his blog about it: The problem with some of the conclusions in this blog, especially that Oracle should have done a better job at reassuring Charles & co about the future of JRuby, goes totally against what is even possible for a company in this situation to do. I’ve heard this comment from several different places, so let me make this very plain. It would have been grossly illegal for any representative from Oracle to give ANY indication to Charles, Tom or Nick about what their intention for JRuby was. It will continue to be this way until the buyout is done. For all we know, Charles, Tom and Nick might have asked any Oracle person they could find what would happen, but they wouldn’t have been able to get an answer they could rely on. That’s how these things work.

Seeing as that insecurity would be around for quite some time, and since this merger is pretty big, it was a reasonable doubt from the JRuby guys perspective that Oracle wouldn’t give any indication for quite some time. During that time the JRuby development would be in jeopardy. So they made a decision to ensure the safety of the project. (When I mean safety of the project, I of course mean continued full time resources for working on it). From this perspective they didn’t really have any choice. This is no indication whatsoever of anything else. It is no indication of Oracle’s future Java strategy, it is no indication of what will happen with languages on JVM in the future. It is just a rational decision based on what can be known right now.

Many from the Ruby and JRuby community has expressed concerns that Engine Yard is primarily a Rails company, and that Rails bugs will take priority over Java integration or other pieces of the JRuby story. This is simply not true. Read any interview with Charles or any of the official announcements. The JRuby focus from Engine Yard will definitely not have overriding Rails concerns.

Another worry I’ve heard is that Engine Yard now “owns” core developers for MRI, Rubinius and JRuby, and as such can use this power to control the future of Ruby. <insert evil laugh here>.

Yes. Engine Yard does have lots of power over the future of Ruby right now. Is that a bad thing? All the above projects are proper open source projects, and nothing EY can do will stop that. EY is a next generation company. They understand open source and they swear by it. Just look at how much internal infrastructure they have opened up and released for general consumption. There can be no doubt that EY believes in open source.

If you’re really worried though… This is your chance to influence things. Submit patches to MRI, Rubinius or JRuby. Contribute enough and you will become a core developer, and you will have as much power as Engine Yard or any of the other core developers. (Remember that only 3 of the 8ish JRuby core developers work for Engine Yard). Once again – if you’re worried, do something about it. Don’t spread FUD.

Personally, I think the future of Ruby is looking bright.

Porting Syck – A story of C, parser generators and general ugliness

As mentioned earlier, a few weeks back I decided to port Syck to Java, for the sake of JRuby. I will detail some interesting experiences in this blog post.

First some introductions. Syck is broadly divided into two pieces – the core library and the language adaptations. The language adaptions are the stuff that is specific to each language and provides the language level APIs. I won’t talk that much about this side of things – most of this post is about the core library.

Any YAML processor is divided into a parser and a emitter. The Syck emitter is pretty straightforward, and so is the Yecht port of it. The parser on the other hand need to be able to do several different things. First of all, it need to be able to handle the fact that YAML is context dependent, which means you can’t do it with a typical parser generator. So the way all YAML engines do it is by having a pretty smart scanner/tokenizer, and then a straight forward parser on top of this. The scanner takes care of the context dependent bits (which are mainly regarding indentation) and abstracts away this so the parser can just make sure to put documents together.

Another piece that is generally necessary is something that checks what type a value is of. Since YAML supports several different core types, and a YAML engine should always strive to read things without needing hints, the processor need to know that the YAML scalar “foo” has tag “,2002:str” while “42” has tag “,2002:int”. Of course there are many more types, including several different versions of integers, floats and timestamps. All of this handling is incidental to the parsing of the document, though. A scalar is a scalar no matter what type it is. So the recognizing of implicits is generally not done in the scanner or the parser, but in a separate scanner.

Syck uses re2c for the token scanner and the implicit scanner, and it uses Bison for the parser. My first version of Yecht was as straight forward as I could make it. I ported the backend of re2c to generate Java, and then I used JACC instead of Bison. So far all good. But when it was done, this initial version was pretty slow. I benchmarked the core library by supplying a nodehandler that didn’t do anything, and the did the same thing for Syck. I also added a small piece that just ran the scanner piece. And what I saw got me a bit depressed. I had gotten Yecht to be totally compatible with Syck, but it was buttslow. For comparison I used a random big YAML document without any real advanced features. Just a combination of mappings, sequences and scalars. For reference, the Syck numbers were these:

scanning ../jruby_newyaml/bench/big_yaml.yml 10000 times took  13959ms
parsing  ../jruby_newyaml/bench/big_yaml.yml 10000 times took  16101ms

And JvYAMLb (the engine I was replacing in JRuby):

scanning ../jruby_newyaml/bench/big_yaml.yml 10000 times took   5325ms
parsing  ../jruby_newyaml/bench/big_yaml.yml 10000 times took  15794ms

And my first version of Yecht:

scanning ../jruby_newyaml/bench/big_yaml.yml 10000 times took  93658ms
parsing  ../jruby_newyaml/bench/big_yaml.yml 10000 times took 117213ms

Ouch. The scanner is 7 times slower, and so is the parsing. And comparing with JvYAMLb, it looks even worse. What’s going on here?

As it happens, I didn’t tell you exactly how I ported the token scanner. The Syck implementation of this used about 10 different gotos to jump between different cases. Now, that’s a pretty fast operation in C. Using Java, I had to do something different. Since I wanted to make the first version of Yecht as close a port as possible, I decided to use a standard transformation to mimic the gotos. In Java you can do this by wrapping the area with a while(true) loop that has a label – and then you can use a variable that contains a number to indicate which goto point to go to next. The actual code lives in a switch statement inside of the while-loop. This code is a bit ugly but if you define some constants you end up with code that looks reasonable much like the C code. Just replace “goto plain3;” with “gotoPoint = plain3; break gotoNext;”.

The first thing I did after finding everything was slow was to try to pinpoint the place where all the performance disappeared. This is actually surprisingly hard. But in this case it turned out to be yylex, which contains the full token scanning logic including my homegrown gotos. Then Charles reminded me about one of those lessons we have learned while developing JRuby – namely that Hotspot really doesn’t like large switch statements. So my first step was to try to untangle the logic of all the gotos.

That was actually not that hard, since the logic bottomed out quite quickly. I managed to replace the switch-based goto-logic with separate methods that called each other instead. That was the only thing I did, and suddenly the scanner wasn’t a problem anymore. These were the numbers after I removed that goto-logic:

scanning ../jruby_newyaml/bench/big_yaml.yml 10000 times took  12207ms
parsing  ../jruby_newyaml/bench/big_yaml.yml 10000 times took  31280ms

Yes, you’re reading that right. The scanning process became 7 times faster by just removing that switch statement and making it into separate methods. Hotspot really doesn’t like large switch statements, and it does like smaller methods. In this case it also made it easier to find the methods that were hotspots in the scanner too.

After this fix it was obvious that the parsing component was the main problem now. Since the approach of removing large switch statements had worked so well in the scanner, I decided to try the same approach in the parser. The parser generator I used – JACC – happens to be one of those generators that generates only code, no tables. Once upon a time this was probably the right behavior for Java, to get good performance, but it’s definitely not the right choice anymore. So I switched from JACC to Jay, and ended up with these numbers:

scanning ../jruby_newyaml/bench/big_yaml.yml 10000 times took  12960ms
parsing  ../jruby_newyaml/bench/big_yaml.yml 10000 times took  15411ms

Nice, huh? At this point I could have felt good about myself and stopped. But I decided to see if re2c/re2j was a bottleneck too. The two main methods in the token scanner that is used on basically all calls are “document()” and “plain()”. I rewrite these sections by hand. Re2j generate switch statements that in most cases take more than one jump to get to the right place. By doing some thinking on these algorithms I made the shortest path much shorter in these two methods. After fixing document() I got these numbers:

scanning ../jruby_newyaml/bench/big_yaml.yml 10000 times took  11226ms
parsing  ../jruby_newyaml/bench/big_yaml.yml 10000 times took  14059ms

And after fixing plain() I got it down to these:

scanning ../jruby_newyaml/bench/big_yaml.yml 10000 times took   9581ms
parsing  ../jruby_newyaml/bench/big_yaml.yml 10000 times took  12838ms

At this point I had a hunch, and decided to check the performance of the implicit-scanner. The first thing I did was write tests to check how fast it was, and compare it with JvYAMLb and Syck. My hunch was right, the re2j-based implicit-scanner was 10 times slower than the equivalent in Syck. The implicits are called during scanning and parsing when a scalar node is done, so I thought it might contribute to the performance. But instead of rewriting it by hand, I decided to move from re2j to Ragel. Since the implicit scanner is pretty limited, this was a move that worked there, but would never have worked for the token scanner. Ragel generates finite state machines in tables, and presto – it turned out to work like a charm. After rewriting the implicit scanner I ended up with these numbers:

scanning ../jruby_newyaml/bench/big_yaml.yml 10000 times took   4804ms
parsing  ../jruby_newyaml/bench/big_yaml.yml 10000 times took   8424ms

And that’s where I decided to stop. I’m not sure what the moral of this tale is, except to generate as much as possible, be careful with switch-statements on hotspot, and rewrite by hand the pieces that really matter.

Also, use the right tool for the job. JACC was obviously not right for me, but Jay seems to be. Re2j is still right for some parts of the token scanner, while Ragel is right for the implicit-scanner. Of course, using the right tool for the job means knowing the alternatives. I knew about Ragel and I know the implementation of both re2j and Ragel, so I could make educated decisions based on their characteristics.

New JRuby YAML support with Yecht

A while back I finally got fed up with all our minor YAML incompatibilities. As I’ve been in charge of the YAML support in JRuby for most of the time, this is something I take personally. I’ve written several YAML processors now, and I decided it was time once and for all to make sure we were totally compatible with MRI.

As it happens, the incompatibilities in JRuby’s YAML support can be divided into two categories – the first category are those things that can’t easily be done with JvYAML since they depend on internals of Syck. More and more of these started cropping up, especially for customizing serialization and loading, but also in how the parsing behavior worked and so on.

The second category are a bit more annoying. These bugs are based on invalid YAML that MRI emits or parses even though it is invalid. Syck happens to be a bit loose and nice – and it’s also a YAML 1.0 processor. JvYAML started life as a YAML 1.1 processor, and it was pretty strict. During the last year I’ve crippled JvYAML, making it more 1.0 compatible and less strict to make it closer to Syck. But at the end of the day full Syck compatibility would never be possible from within JvYAMLb.

So I started hacking on Yecht. Two weeks later it is now merged into JRuby trunk. Yecht is a proper port of Syck that matches Syck semantics more or less to the letter – including bugs. Don’t believe me? Just try “, nil, nil).kind” on MRI and “, nil, nil).kind” on JRuby and see…

As it happens, the story of how I ported Syck is quite interesting, so I will write a separate post about that, focusing on some of the more impressive performance improvements I managed to squeeze out of the parser.

But the short story is this: JRuby’s YAML support is now better than ever, and much more compatible to how MRI does things. All open YAML bugs in JRuby’s bug tracker have been closed, and all tests run as they should.

New Hpricot for JRuby

One of the annoying things for JRuby have been lack of recent support for Hpricot. Hopefully that will all be in the past now, or at least in a few days time. I spent some time hacking the latest version of Hpricot to work on JRuby, and as of now, my fork of it runs all tests like it should.

I first want to talk quickly about Nokogiri. Someone on Twitter mentioned that we should be using Nokogiri instead of Hpricot. That’s all good and well, except Nokogiri depends on libxml2 and libxslt2. I know I have done some crazy porting of C-libraries to Java earlier, but I won’t do those two. Ever. So the only chance Nokogiri will have is if someone reimplements the backend to use native Java XML libraries. That wouldn’t be extremely hard, but it would definitely be more time consuming than fixing Hpricot.

So, that’s why I’ve fixed Hpricot. I’ve sent a pull request to Why, and hopefully the Java stuff will all be in his repository soon. Until then you can build your own version of the gem by cloning, and installing that.

For those of you who have used the earlier JRuby compatible Hpricot versions, I dare say this version is much faster in many ways. I learned a bit about how to write these kind of ports when doing Yecht, and I think that shows.

Second day of JavaOne

The second day of JavaOne ended up being not as draining as the first one, although I had lots of interesting times this day too. I’ve divided it into two blog posts – this is about what happened at JavaOne, and the next one will be about the Clojure meetup.

The first session of the day was Nick Siegers talk about using JRuby in production at Kenai. An interesting talk about some of the things that worked, and some of the things that didn’t work. A surprising number of decisions were given as fiat since they needed to use Sun products for many things.

After that Neal Ford gave a comparison between JRuby and Groovy. I don’t have much to say about this talk except it seemed that some of the things seemed to be a bit more complicated to achieve in Groovy, than in Ruby.

As it turns out, the next talk was my final talk of the day. This was Bob Lee (crazy bob) talking about references and garbage collection on the JVM. A very good talk, and I learned about how the Google Collections MapMaker actually solves some of my Ioke problems. I ended up integrating it during the evening and it works great.

The second day had fewer talks for me – but I still had a very good time and even learned some stuff. Nice.