Faster YAML with byte processing

As noted in my last post, I have started work on converting JvYAML into JvYAMLb. Right now I have finished the work on the Scanner and the Parser, and it’s looking quite good. The numbers I reported in the last post for regular JvYAML performance was wrong though. We’re looking at about 7.8s to 10.0s for scanning that 3.5MB gemspec file. (And that’s only the scanning, not file IO). But with the Scanner converted to use bytes and ByteList, the same processing takes 2.8s. That’s a substantial difference. But it doesn’t end with that.

As I said I also converted the Parser. It doesn’t do any String processing at all, so I didn’t expect either a speedup or slowdown except for that from the Scanner. But… Before, parsing the gemspec took 18.515s, but after, it runs in 4s. That’s a dramatic speedup, and I don’t really know where it comes from. Unless the earlier implementation generated so much more garbage, and used more memory, that it was noticeable in speed. Anyway, this looks good for JRuby YAML processing, since I expect big reductions in complexity in the callpath and generation of objects after the YAML processor is byted all the way through.

But tomorrow it’s time to work on the Resolver, and that’s going to be hard. Optimally, it would be nice to have a byte-based Regexp engine. And maybe that would be something for JRuby too, know? Our Regular Expressions must be dead slow now that they have to convert to strings all the time.

Announcing JvYAMLb, a fork

The conversion to using byte-arrays as the basis of our String work in JRuby has led me to realize that JvYAML just doesn’t cut it anymore. The performance wasn’t good to begin with, and it’s even worse having to convert EVERY SINGLE STRING read into bytes. That’s no good. As an example why something needs to be done I’m going to describe the transformations that happen to data in JRuby if executing this code:

YAML.load_file "gems.yml"

First, the file is opened, and wrapped inside a RandomAccessFile. Then data is read from it by YAML. Reading will proceed like this:
1. Bytes are read through the RAF, hopefully in chunks.
2. Those bytes are wrapped in a RubyString so they can be returned from the IO#read method.
3. An IOReader wraps that RubyIO object, gets the RubyString and converts it from bytes into a String, and this String gets converted into a char array.
4. That char array is returned to the YAML Scanner.
5. The chars from the char array is collected in a StringBuffer, and saved in various Strings as token values.
6. The parser, resolver and constructor work on these Strings in various ways.
7. The JRubyConstructor takes these Strings and creates RubyString objects from them and in the process converting the String back to a byte array.

Is there any doubt that this process is slow? Well, it hasn’t been that big of a problem until now, since we are doing so well on performance in other parts of the system.

So, the radical decision is to rewrite JvYAML, making it more SYCK-compliant, working with InputStreams and byte-arrays, and in the process get away from several of the steps above. So that’s what I’m going to do. I hereby create JvYAMLb. It will only be a part of the JRuby codebase, but it will be reasonably separate, so it can be extracted for other purposes. I will not stop work on regular JvYAML, but will maintain both projects.

Since the objective of this new project is blazing speed, I will post some numbers on this now and again. But first I will show you the speed of the regular system. JvYAML’s Scanner can scan an old gem source index (about 3.5MB) of 435654 tokens in about 1654ms. This is the baseline I’m going to use to test performance, and I’ll post more on this as soon as the byte-based Scanner is ready to try out.

Bytes bites. Or maybe not.

Well, the byte arrays are in, for good and evil. We had to wrap them in a counterpart to StringBuffer, but backed by byte[] instead, since all that explicit allocation and deallocation was way unperformant.

Of course, we aren’t seeing any performance benefits from this right now. The problem is that there is still many places that use IRubyObject#toString to get at the contents. That operation is very expensive right now, so gem installs are slower, for example. But we have good hopes on improving the situation, and many parts of the codebase have become much clearer without the need to do String-to-byte[] and byte[]-to-String all over the place.