JvYAMLb finally released as separate project

So I’ve finally made the time to extract JvYAMLb from JRuby. That means that JvYAML is mildly deprecated. Of course, since JvYAMLb only uses ByteLists it might not be the best solution for everyone.

If you’re interested in downloading the 0.1 release, you can do it at the Google Code download site http://code.google.com/p/jvyamlb/downloads/list.

Ragel performance

I did some performance testing on the old and new Resolver implementation. The testing have some stupid tests that exercise bad parts of both implementations (like longest match, where it can’t be decided what type something is until we have to backtrack about 20 characters). I placed these 24 strings in an array, and pounded on it with an instance of the ResolverImpl that is used in exactly the same way on all scalar values in an YAML document. The objective is to find out if the value is an implicit type or not. So basically, we give it a String, and get back a tag URI. So it’s not like I’m parsing a language or anything. I’m just doing some recognizing here.

The old implementation was based on a Map> where the first letter of the string to resolve was used as an index to find a list of patterns to try sequentially. This worked fine, and made it extensible. But not very fast. This is the baseline. For 24 different strings, iterated 100 000 times for 2 400 000 resolves it takes 7879ms. That’s OK, but not great.

Now, the new Ragel implementation is dead simple. It’s just a translation of the regexps in the aforementioned Pattern’s into a state machine. At EOF out actions (%/ for people in the Ragel knowhow), I execute an action that sets a local variable to a tag, and at the end of the resolve method returns that tag. Dead simple, and not exercising the full strength of Ragel, of course.
So, for the same number of resolves, this ResolverImpl takes 1288ms. That’s 611% improvement in speed. Ain’t it nice to have a friend such as Ragel? And the best part is, for harder tasks, these improvements would be even larger.
Finite State Machines are your friends. All your base are belongs to us.

Results of jvYAMLb

Well, the YAML-based loading is in JRuby trunk. On the way, some parts of the codebase got seriously simplified. Very nice. The final result, with regard to performance, is about 20-30% on speed. But the important gain is in memory usage. The new implementation takes only about one fourth of the memory the original used. So that’s great.

Regarding the Resolver, as I mentioned in the last post, it required a different approach, since regular JvYAML uses regular expressions to recognize implicit tags. Since that approach isn’t good with byte arrays, I decided to use Ragel to generate a recognizer. That approach was very successful. As soon as I got that working it was the obvious approach. Ragel is good. Ragel is great. Ragel is wonderful. I will use the same approach for regular JvYAML to get away from all those Java regexps.

So, next step will be to do the same conversion of the emitter. Of course, at that point performance isn’t that important. It’s more about memory usage and the need to get away from another external dependency in JRuby.

Announcing JvYAMLb, a fork

The conversion to using byte-arrays as the basis of our String work in JRuby has led me to realize that JvYAML just doesn’t cut it anymore. The performance wasn’t good to begin with, and it’s even worse having to convert EVERY SINGLE STRING read into bytes. That’s no good. As an example why something needs to be done I’m going to describe the transformations that happen to data in JRuby if executing this code:

YAML.load_file "gems.yml"

First, the file is opened, and wrapped inside a RandomAccessFile. Then data is read from it by YAML. Reading will proceed like this:
1. Bytes are read through the RAF, hopefully in chunks.
2. Those bytes are wrapped in a RubyString so they can be returned from the IO#read method.
3. An IOReader wraps that RubyIO object, gets the RubyString and converts it from bytes into a String, and this String gets converted into a char array.
4. That char array is returned to the YAML Scanner.
5. The chars from the char array is collected in a StringBuffer, and saved in various Strings as token values.
6. The parser, resolver and constructor work on these Strings in various ways.
7. The JRubyConstructor takes these Strings and creates RubyString objects from them and in the process converting the String back to a byte array.

Is there any doubt that this process is slow? Well, it hasn’t been that big of a problem until now, since we are doing so well on performance in other parts of the system.

So, the radical decision is to rewrite JvYAML, making it more SYCK-compliant, working with InputStreams and byte-arrays, and in the process get away from several of the steps above. So that’s what I’m going to do. I hereby create JvYAMLb. It will only be a part of the JRuby codebase, but it will be reasonably separate, so it can be extracted for other purposes. I will not stop work on regular JvYAML, but will maintain both projects.

Since the objective of this new project is blazing speed, I will post some numbers on this now and again. But first I will show you the speed of the regular system. JvYAML’s Scanner can scan an old gem source index (about 3.5MB) of 435654 tokens in about 1654ms. This is the baseline I’m going to use to test performance, and I’ll post more on this as soon as the byte-based Scanner is ready to try out.