RubyFoo


I spent this Friday and Saturday in London at the RubyFoo conference, organized by Trifork. RubyFoo is a small pre-conference to the larger JAOO conference. As you might expect, it’s focused on Ruby, and it’s quite small. On the friday we were about 50 people, and on Saturday about 40. The small amount of people and the fact that all presentations were in the same track made it much easier to network and communicate with people. I liked the focus this gave to the conference, and it was also an excellent opportunity to meet new people and get new ideas.

On the Friday there were five presentations, and on the Saturday it was an open spaces. The five presentations were all focused around the area of communicative programming. I talked about JRuby and did several demonstrations of how JRuby can be used to call out to different languages. My examples included talking to Clojure, Erlang and Haskell.

After me, Aslak Hellesøy talked about Cucumber and how Cucumber supports lots of different programming languages. Very cool. Aslak always give good presentations.

We then had lunch, and then Sam Aaron gave an interesting talk about communicative programming, and the essence of what we are doing. Very cerebral, definitely something that sparked lots of thoughts in peoples minds.

Adam Wiggins gave a talk about Heruko. I haven’t actually tried Heruko yet, but it looks very cool.

Finally, Matz gave a talk about the different styles of programming in Ruby, tied in with his history of creating Ruby and what the inspirations were. Very nice.

On the Saturday my colleague Dan North facilitated the open spaces discussions. I gave a 30 minute talk about Ioke – people seemed to enjoy it. After that Dan North, me, Aslak and a few others had a discussion about static versus dynamic typing.

After lunch I held a discussion about Ruby 1.9, getting some ideas why people weren’t using it, and what problems the people using it had encountered.

Finally, me, Aslak and Sam sat down to add Ioke support to Cucumber. This went really well – and I liked pairing with Aslak. Sadly I couldn’t stay until we were done, but Aslak and the others continued while I was heading out to the airport.

All in all, RubyFoo was a great conference, and I hope they can keep the same size the next time. 50 people were really a great size, and I liked the discussions we had.



A new parser for Ioke


Last week I finally bit the bullet and rewrote the Ioke parser. I’m pretty happy about the end result actually, but it does involve moving away from Antlr’s as a parser generator. In fact, the new parser is handwritten – and as such goes against my general opinion to generate everything possible. I would like to quickly take a look at the reasons for doing this and also what the new parser will give Ioke.

For reference, the way the parser used to work was that the Antlr generated lexer and parser gave the Ioke runtime an Antlr Tree structure. This tree structure was then walked and transformed into chained Message’s, which is the AST that Ioke uses internally. Several other things were also done at this stage, including separating message chains on comma-borders. Most significantly the processing to put together interpolated strings and regular expressions happened at this stage. Sadly, the code to handle all that was complex, ugly, slow and frail. After this stage, operator shuffling happened. That part is still the same.

There were several problems I wanted to solve, but the main one was the ugliness of the algorithm. It wasn’t clear from the parser how an interpolated expression mapped into the AST, and the generated code added several complications that frankly weren’t necessary.

Ioke is a language with an extremely simple base syntax. It is only slightly more complicated than the typical Lisp parser, and there is almost no parser-level productions needed. So the new parser does away with the lexer/parser distinction and does everything in one pass. There is no need for lookahead at the token level, so this turns out to be a clear win. The code is actually much simpler now, and the Message AST is created inline in the new parser. When it comes to interpolation, instead of the semantic predicates and global stacks I had to use in the Antlr parser, I just do the obvious recursive interpolation. The code is simple to understand and quite efficient too.

At the end of the day, I did expect to see some performance improvements too. They turned out to be substantial. Parsing is about 2.5 times faster, and startup speed has improved by about 30%. The distribution size will be substantially smaller since I don’t need to ship the Antlr runtime libraries. And building the project is also much faster.

But the real gain is actually in maintainability of the code. It will be much easier for me to extend the parser now. I can do nice things to make the syntax more open ended and more powerful in ways that would be very inconvenient in Antlr. The error messages are much better since I have control over all the error states. In fact, there are only 13 distinct error messages in the new parser, and they are all very clear on what has gone wrong – I never did the work in the old parser to support that, but I get that almost for free in the new one.

Another thing I’ve been considering is to add reader macros to Ioke – and that would also have been quite painful with the Antlr parser generator. So all in all I’m very happy about the new parser, and I think it will definitely make it easier for the project going forward.

This blog post is in no way saying that Antlr is bad in any way. I like Antlr a lot – it’s a great tool. But it just wasn’t the right tool for Ioke’s syntax.



ThoughtWorks Seminar and Tutorial in Stockholm


September 29th, ThoughtWorks will hold a day of seminars and a tutorial in Stockholm, Sweden. The seminars are free. I will talk about alternative languages, Martin Fowler will talk about software design in the 21st century, and another ThoughtWorks speaker will talk about DSLs for functional testing.

The tutorial is a half day tutorial given by Martin Fowler and me. We will talk about domain specific languages.

If this sounds interesting, go in and find more information and register here. Hurry, though – places are limited!



Tutorial at JAOO about JRuby testing


Just thought I’d mention it here – I’m at JAOO this year and will give a tutorial about testing Java with JRuby. I will be a great tutorial and I hope to see many of you there.



Upcoming talks


There hasn’t been much interesting happening this summer, but the fall is shaping up to be pretty busy. I will be talking at several different conferences, and thought I’d mention when and where I will be appearing.

First, this week I’m presenting at JavaZone in Oslo. I will present at 11:45 tomorrow, talking about Ioke.

Next week is the JVM Language Summit in Santa Clara. It is shaping up to be a great collection of people with many interesting discussions and talks. Take a look at the details for the talks. The people there are some of the most experienced language developers and implementors in the world. It should be a blast. I will do a talk about Ioke, and also a workshop about the challenges of improving Ioke’s performance.

After that I will attend RubyFoo in London, Oct 2-3, where I will talk about JRuby. RubyFoo will feature Matz, Sam Aaron, Aslak Hellesøy, Adam Wiggins and me. It should be great fun!

At JAOO this year (Oct 4-9 in Aarhus, Denmark) I will do a tutorial about testing Java code with JRuby. This conference also looks like it will be great. Many interesting talks and speakers. And of course, JAOO is generally the best conference I’ve ever been to.

At Øredev in Malmö, Sweden (Nov 2-6), I will be talking about Ioke.

And finally, at QCon SF in San Francisco (Nov 16-20) I will be hosting a track on emerging languages. After JAOO, QCon is my favorite conference, so I think it will be very nice too.

So, several interesting conferences coming up. Hope to see many of you there!



Charles, Tom and Nick to EngineYard – and the future of JRuby


Most people have already heard the news that Charles, Tom and Nick are going to Engine Yard to work on JRuby. I’ve been asked for my opinion by a few people, and I’ve also seen some common reactions that I would like to comment on. Of course I only speak for myself, not for Charles, Tom or Nick, and definitely not for neither Sun, Oracle or Engine Yard.

Lets get the congratulations in order first. This is great news for Charles, Tom and Nick, and I definitely wish them well with at their new work. I totally understand their move and would have done the same thing if I had been in the same situation.

This is also good news for the JRuby project. The main concern from Charles and company has been to ensure that the JRuby project doesn’t suffer – that has been the overriding concern in this decision. Of course, having Nick be able to work on JRuby proper will also be great. Another full time resource.

Now for some of the comments and worries. Tim Anderson writes in his blog about it: http://www.itjoblog.co.uk/2009/07/jruby.html. The problem with some of the conclusions in this blog, especially that Oracle should have done a better job at reassuring Charles & co about the future of JRuby, goes totally against what is even possible for a company in this situation to do. I’ve heard this comment from several different places, so let me make this very plain. It would have been grossly illegal for any representative from Oracle to give ANY indication to Charles, Tom or Nick about what their intention for JRuby was. It will continue to be this way until the buyout is done. For all we know, Charles, Tom and Nick might have asked any Oracle person they could find what would happen, but they wouldn’t have been able to get an answer they could rely on. That’s how these things work.

Seeing as that insecurity would be around for quite some time, and since this merger is pretty big, it was a reasonable doubt from the JRuby guys perspective that Oracle wouldn’t give any indication for quite some time. During that time the JRuby development would be in jeopardy. So they made a decision to ensure the safety of the project. (When I mean safety of the project, I of course mean continued full time resources for working on it). From this perspective they didn’t really have any choice. This is no indication whatsoever of anything else. It is no indication of Oracle’s future Java strategy, it is no indication of what will happen with languages on JVM in the future. It is just a rational decision based on what can be known right now.

Many from the Ruby and JRuby community has expressed concerns that Engine Yard is primarily a Rails company, and that Rails bugs will take priority over Java integration or other pieces of the JRuby story. This is simply not true. Read any interview with Charles or any of the official announcements. The JRuby focus from Engine Yard will definitely not have overriding Rails concerns.

Another worry I’ve heard is that Engine Yard now “owns” core developers for MRI, Rubinius and JRuby, and as such can use this power to control the future of Ruby. <insert evil laugh here>.

Yes. Engine Yard does have lots of power over the future of Ruby right now. Is that a bad thing? All the above projects are proper open source projects, and nothing EY can do will stop that. EY is a next generation company. They understand open source and they swear by it. Just look at how much internal infrastructure they have opened up and released for general consumption. There can be no doubt that EY believes in open source.

If you’re really worried though… This is your chance to influence things. Submit patches to MRI, Rubinius or JRuby. Contribute enough and you will become a core developer, and you will have as much power as Engine Yard or any of the other core developers. (Remember that only 3 of the 8ish JRuby core developers work for Engine Yard). Once again – if you’re worried, do something about it. Don’t spread FUD.

Personally, I think the future of Ruby is looking bright.



Porting Syck – A story of C, parser generators and general ugliness


As mentioned earlier, a few weeks back I decided to port Syck to Java, for the sake of JRuby. I will detail some interesting experiences in this blog post.

First some introductions. Syck is broadly divided into two pieces – the core library and the language adaptations. The language adaptions are the stuff that is specific to each language and provides the language level APIs. I won’t talk that much about this side of things – most of this post is about the core library.

Any YAML processor is divided into a parser and a emitter. The Syck emitter is pretty straightforward, and so is the Yecht port of it. The parser on the other hand need to be able to do several different things. First of all, it need to be able to handle the fact that YAML is context dependent, which means you can’t do it with a typical parser generator. So the way all YAML engines do it is by having a pretty smart scanner/tokenizer, and then a straight forward parser on top of this. The scanner takes care of the context dependent bits (which are mainly regarding indentation) and abstracts away this so the parser can just make sure to put documents together.

Another piece that is generally necessary is something that checks what type a value is of. Since YAML supports several different core types, and a YAML engine should always strive to read things without needing hints, the processor need to know that the YAML scalar “foo” has tag “tag:yaml.org,2002:str” while “42” has tag “tag:yaml.org,2002:int”. Of course there are many more types, including several different versions of integers, floats and timestamps. All of this handling is incidental to the parsing of the document, though. A scalar is a scalar no matter what type it is. So the recognizing of implicits is generally not done in the scanner or the parser, but in a separate scanner.

Syck uses re2c for the token scanner and the implicit scanner, and it uses Bison for the parser. My first version of Yecht was as straight forward as I could make it. I ported the backend of re2c to generate Java, and then I used JACC instead of Bison. So far all good. But when it was done, this initial version was pretty slow. I benchmarked the core library by supplying a nodehandler that didn’t do anything, and the did the same thing for Syck. I also added a small piece that just ran the scanner piece. And what I saw got me a bit depressed. I had gotten Yecht to be totally compatible with Syck, but it was buttslow. For comparison I used a random big YAML document without any real advanced features. Just a combination of mappings, sequences and scalars. For reference, the Syck numbers were these:

scanning ../jruby_newyaml/bench/big_yaml.yml 10000 times took  13959ms
parsing  ../jruby_newyaml/bench/big_yaml.yml 10000 times took  16101ms

And JvYAMLb (the engine I was replacing in JRuby):

scanning ../jruby_newyaml/bench/big_yaml.yml 10000 times took   5325ms
parsing  ../jruby_newyaml/bench/big_yaml.yml 10000 times took  15794ms

And my first version of Yecht:

scanning ../jruby_newyaml/bench/big_yaml.yml 10000 times took  93658ms
parsing  ../jruby_newyaml/bench/big_yaml.yml 10000 times took 117213ms

Ouch. The scanner is 7 times slower, and so is the parsing. And comparing with JvYAMLb, it looks even worse. What’s going on here?

As it happens, I didn’t tell you exactly how I ported the token scanner. The Syck implementation of this used about 10 different gotos to jump between different cases. Now, that’s a pretty fast operation in C. Using Java, I had to do something different. Since I wanted to make the first version of Yecht as close a port as possible, I decided to use a standard transformation to mimic the gotos. In Java you can do this by wrapping the area with a while(true) loop that has a label – and then you can use a variable that contains a number to indicate which goto point to go to next. The actual code lives in a switch statement inside of the while-loop. This code is a bit ugly but if you define some constants you end up with code that looks reasonable much like the C code. Just replace “goto plain3;” with “gotoPoint = plain3; break gotoNext;”.

The first thing I did after finding everything was slow was to try to pinpoint the place where all the performance disappeared. This is actually surprisingly hard. But in this case it turned out to be yylex, which contains the full token scanning logic including my homegrown gotos. Then Charles reminded me about one of those lessons we have learned while developing JRuby – namely that Hotspot really doesn’t like large switch statements. So my first step was to try to untangle the logic of all the gotos.

That was actually not that hard, since the logic bottomed out quite quickly. I managed to replace the switch-based goto-logic with separate methods that called each other instead. That was the only thing I did, and suddenly the scanner wasn’t a problem anymore. These were the numbers after I removed that goto-logic:

scanning ../jruby_newyaml/bench/big_yaml.yml 10000 times took  12207ms
parsing  ../jruby_newyaml/bench/big_yaml.yml 10000 times took  31280ms

Yes, you’re reading that right. The scanning process became 7 times faster by just removing that switch statement and making it into separate methods. Hotspot really doesn’t like large switch statements, and it does like smaller methods. In this case it also made it easier to find the methods that were hotspots in the scanner too.

After this fix it was obvious that the parsing component was the main problem now. Since the approach of removing large switch statements had worked so well in the scanner, I decided to try the same approach in the parser. The parser generator I used – JACC – happens to be one of those generators that generates only code, no tables. Once upon a time this was probably the right behavior for Java, to get good performance, but it’s definitely not the right choice anymore. So I switched from JACC to Jay, and ended up with these numbers:

scanning ../jruby_newyaml/bench/big_yaml.yml 10000 times took  12960ms
parsing  ../jruby_newyaml/bench/big_yaml.yml 10000 times took  15411ms

Nice, huh? At this point I could have felt good about myself and stopped. But I decided to see if re2c/re2j was a bottleneck too. The two main methods in the token scanner that is used on basically all calls are “document()” and “plain()”. I rewrite these sections by hand. Re2j generate switch statements that in most cases take more than one jump to get to the right place. By doing some thinking on these algorithms I made the shortest path much shorter in these two methods. After fixing document() I got these numbers:

scanning ../jruby_newyaml/bench/big_yaml.yml 10000 times took  11226ms
parsing  ../jruby_newyaml/bench/big_yaml.yml 10000 times took  14059ms

And after fixing plain() I got it down to these:

scanning ../jruby_newyaml/bench/big_yaml.yml 10000 times took   9581ms
parsing  ../jruby_newyaml/bench/big_yaml.yml 10000 times took  12838ms

At this point I had a hunch, and decided to check the performance of the implicit-scanner. The first thing I did was write tests to check how fast it was, and compare it with JvYAMLb and Syck. My hunch was right, the re2j-based implicit-scanner was 10 times slower than the equivalent in Syck. The implicits are called during scanning and parsing when a scalar node is done, so I thought it might contribute to the performance. But instead of rewriting it by hand, I decided to move from re2j to Ragel. Since the implicit scanner is pretty limited, this was a move that worked there, but would never have worked for the token scanner. Ragel generates finite state machines in tables, and presto – it turned out to work like a charm. After rewriting the implicit scanner I ended up with these numbers:

scanning ../jruby_newyaml/bench/big_yaml.yml 10000 times took   4804ms
parsing  ../jruby_newyaml/bench/big_yaml.yml 10000 times took   8424ms

And that’s where I decided to stop. I’m not sure what the moral of this tale is, except to generate as much as possible, be careful with switch-statements on hotspot, and rewrite by hand the pieces that really matter.

Also, use the right tool for the job. JACC was obviously not right for me, but Jay seems to be. Re2j is still right for some parts of the token scanner, while Ragel is right for the implicit-scanner. Of course, using the right tool for the job means knowing the alternatives. I knew about Ragel and I know the implementation of both re2j and Ragel, so I could make educated decisions based on their characteristics.



New JRuby YAML support with Yecht


A while back I finally got fed up with all our minor YAML incompatibilities. As I’ve been in charge of the YAML support in JRuby for most of the time, this is something I take personally. I’ve written several YAML processors now, and I decided it was time once and for all to make sure we were totally compatible with MRI.

As it happens, the incompatibilities in JRuby’s YAML support can be divided into two categories – the first category are those things that can’t easily be done with JvYAML since they depend on internals of Syck. More and more of these started cropping up, especially for customizing serialization and loading, but also in how the parsing behavior worked and so on.

The second category are a bit more annoying. These bugs are based on invalid YAML that MRI emits or parses even though it is invalid. Syck happens to be a bit loose and nice – and it’s also a YAML 1.0 processor. JvYAML started life as a YAML 1.1 processor, and it was pretty strict. During the last year I’ve crippled JvYAML, making it more 1.0 compatible and less strict to make it closer to Syck. But at the end of the day full Syck compatibility would never be possible from within JvYAMLb.

So I started hacking on Yecht. Two weeks later it is now merged into JRuby trunk. Yecht is a proper port of Syck that matches Syck semantics more or less to the letter – including bugs. Don’t believe me? Just try “YAML::Syck::Map.new(nil, nil, nil).kind” on MRI and “YAML::Yecht::Map.new(nil, nil, nil).kind” on JRuby and see…

As it happens, the story of how I ported Syck is quite interesting, so I will write a separate post about that, focusing on some of the more impressive performance improvements I managed to squeeze out of the parser.

But the short story is this: JRuby’s YAML support is now better than ever, and much more compatible to how MRI does things. All open YAML bugs in JRuby’s bug tracker have been closed, and all tests run as they should.



New Hpricot for JRuby


One of the annoying things for JRuby have been lack of recent support for Hpricot. Hopefully that will all be in the past now, or at least in a few days time. I spent some time hacking the latest version of Hpricot to work on JRuby, and as of now, my fork of it runs all tests like it should.

I first want to talk quickly about Nokogiri. Someone on Twitter mentioned that we should be using Nokogiri instead of Hpricot. That’s all good and well, except Nokogiri depends on libxml2 and libxslt2. I know I have done some crazy porting of C-libraries to Java earlier, but I won’t do those two. Ever. So the only chance Nokogiri will have is if someone reimplements the backend to use native Java XML libraries. That wouldn’t be extremely hard, but it would definitely be more time consuming than fixing Hpricot.

So, that’s why I’ve fixed Hpricot. I’ve sent a pull request to Why, and hopefully the Java stuff will all be in his repository soon. Until then you can build your own version of the gem by cloning http://github.com/olabini/hpricot, and installing that.

For those of you who have used the earlier JRuby compatible Hpricot versions, I dare say this version is much faster in many ways. I learned a bit about how to write these kind of ports when doing Yecht, and I think that shows.



Re2j – a small lexer generator for Java


There is a tool called re2c. It’s pretty neat. Basically it allows you to intersperse a regular expression based grammar in comments inside of C code, and those comments will be transformed into a basic lexer. There are a few things that make re2c different from other similar tools. The first one is that the supported features are pretty limited (which is good). The code generated is fast. The other good part is that you can have several sections in the same source file. The productions for any specific piece of code are constrained to the specific comment.

As it happens, why the lucky stiff used re2c when he made Syck (the C-based YAML processor used in Ruby and many other languages). So when I set set out to port Syck to Java, the first problem was to figure out the best way to port the lexers using re2c. I ended up using Ragel for the implicit-scanner, and thought about doing the same for the token scanner, but Ragel is pretty painful to use for more than one main production in the same source file. The syntax is not exactly the same either, so it would add to the burden of porting the scanner if I decided to switch.

At the end of the day the most pragmatic choice was to port the output generator in re2c to generate Java instead. This turned out to be pretty easy, and the result is now used in Yecht, which was merged as the YAML processor for JRuby a few days ago.

You can find re2j in my github repository at http://github.com/olabini/re2j. This is still a C++ program, and it probably won’t compile very well on windows. But it’s good enough for many small use cases. Everything works exactly as re2c except for one small difference, namely that you can define a parameter called YYDATA that points to a byte or char buffer that should be the place to read from. For an example usage, take a look at the token scanner: http://github.com/olabini/yecht/blob/master/src/main/org/yecht/TokenScanner.re.

I haven’t put any compiled binaries out anywhere, and at some point it might be nice to merge this with the proper re2c project so you can give a flag to generate Java instead of C, but for now this is all there is to the project.