JRuby Regular Expressions

The Regular Expression support in JRuby is about to be revamped. I will here detail my plans for this work, and also some of the reasons for it. This post is as much for people interested in JRuby, as for the JRuby developers.

Stage 0: java.util.regex
JRuby has traditionally used java.util.regex. We stopped doing that March 11:th, 2007. The main reasons are because of the disconnect with MRI. Some of the operators work very differently, there are some problems with UTF-8, we can’t support SJIS or EUC, nor posix-classes. And java.util.regex also uses a recursive implementation which means it can’t handle certain large inputs. Further, we would like to be able to modify the implementation to work with the same stuff that backs the RubyString, to increase performance.

Stage 1: JRegex
Yesterday (March 11:th), I merged JRegex as the main regular expression engine for JRuby. The main reasons for this is twofold. First, we can change the implementation quite easily, and second, it is an iterative algorithm, which means it doesn’t fail on the input that java.util.regex does. Since this caused problems with some Rails tests (and also in multipart handling in all libraries using cgi), I decided to merge this as a stopgap until the next incarnation of regex support.

Stage 2: REJ
In about 2 weeks, I hope to be able to merge REJ with JRuby. At the point of merging, it should be a better replacement than both JRegex and java.util.regex. REJ is a project I’ve started, which will be a direct port of the MRI 1.8.6 regular expression engine. The important thing about this is that the semantics for JRuby will match MRI very closely. We will be able to match UTF-8, SJIS and EUC regular expressions, and we are able to have the same quirks as MRI, even though people shouldn’t depend on such quirks. In the process of writing REJ, I will also create a large suite of test cases for regular expressions, based on Henry Spencer’s test files. I’ll probably submit something initial to a separate repository very soon. If I get my wish, REJ is what will be the regular expression engine for JRuby 1.0.

Stage 3: Ojiguruma
After 1.0 has been released, I think it’s time to make the Regexp engine in JRuby really extensible, and provide an interface from Ruby to change which engine to use. After that is done, I would be very interested in doing a port of Oniguruma to Java, which would give us far better multilanguage support, and also some interesting features. The reason I’m choosing to not do this right now is because Oniguruma is just too large.

Stage 4: (No official name yet)
Another engine that some in the JRuby/Ruby community has started working on is an engine which will be based on Ragel for parsing and a modified version of Thompson NFA and TCL-style backreferences for matching. It’s an interesting project but it will take some time before it’s usable.

No Comments, Comment or Ping

Reply to “JRuby Regular Expressions”