Know your Regular Expression anchors


As everyone knows, regular expressions are incredibly important in many programming tasks. So it pays to know some of the particulars of the regexp syntax. One example that bit me a while back was a simple oversight – something I did know but hadn’t kept in mind while writing the bad code. Namely, the way the caret (^) works when used in a String with newlines in it. To be fair I’ve been using Java regexps for a while and that problem doesn’t exist there.

To illustrate the difference, here is a program you can run in either MRI or JRuby. If running in JRuby you’ll see that the Java version needs the flag MULTILINE to behave as Ruby does by default.

str = "one\nover\nyou"
puts "Match with ^"
str.gsub(/^o/) do |e|
p $~.offset(0)
e
end

puts "Match with \\A"
str.gsub(/\Ao/) do |e|
p $~.offset(0)
e
end


if defined?(JRUBY_VERSION)
require 'java'
regexp = java.util.regex.Pattern.compile("^o", java.util.regex.Pattern::MULTILINE)
matcher = regexp.matcher(str)
puts "Java match with ^"
while matcher.find()
p matcher
end

regexp = java.util.regex.Pattern.compile("\\Ao", java.util.regex.Pattern::MULTILINE)
matcher = regexp.matcher(str)
puts "Java match with \\A"
while matcher.find()
p matcher
end
end

So, what’s the lesson here? Don’t use caret (^) and dollar ($) if you actually want to match the beginning or the end of the string. Instead, use \A and \Z. That’s what they’re there for.



Back to JRuby regular expressions


It seems that this issue comes up every third month. After all the work we have done, we realize that regular expressions need some real work again. Our current solution works quite well. We have imported JRegex into JRuby, and done a whole slew of modifications to it. It runs well, have no issues with to deep regular expressions (Javas engine uses a recursive algorithm, making it stack overflow for certain inputs. Certain very common inputs, in say … Rails. *sigh*).

But JRegex is good. It’s not perfect though. It’s slightly slower than the Java engine, it doesn’t support everything in the Java engine, and conversely, it supports some things that Java doesn’t support. The major problem is that we don’t have MRI compliant multibyte support, and the implementation of our engine is wildly different compared to MRI’s engine, and Oniguruma.

At some point we will probably just bite the bullet and do a real port of Oniguruma. But until such time comes, I have extracted our current regular expression stuff, and put everything behind a common interface. What that means is that with the current trunk, you can actually choose which Regular Expression engine you want to use. You can even write your own and plug in. The interface is really small right now. At the moment we only have JRegex and Java, and the Java engine doesn’t pass all tests (I think, I haven’t tried, since that wasn’t the point of this exercise.). Anyway; it means you can have Java Regular Expressions if you want them, right in your JRuby code. But only where you want them. So, you can regular which engine is used globally by doing one of these two:

jruby -J-Djruby.regexp=java your_ruby_script.rb
jruby -J-Djruby.regexp=jregex your_ruby_script.rb

The last is current the default, so it’s not needed. In the future it may be possible that JRegex isn’t the default though, but this options should still be there. But the more nice thing about this is also that you can use Java Regexps inline, even if you want to use JRegex for most expressions:

begin
p(/\p{javaLowerCase}*/ =~ "abc")
p $&
rescue => e
p e
end

p(/\p{javaLowerCase}*/j =~ "abc")
p $&

Now, the first example will actually raise a RegexpError, because javaLowerCase is not a valid character class in JRegex. But not the small “j” I’ve added to the second Regexp literal! That expression works and will match exactly as you expected.



JRuby Regular Expressions


The Regular Expression support in JRuby is about to be revamped. I will here detail my plans for this work, and also some of the reasons for it. This post is as much for people interested in JRuby, as for the JRuby developers.

Stage 0: java.util.regex
JRuby has traditionally used java.util.regex. We stopped doing that March 11:th, 2007. The main reasons are because of the disconnect with MRI. Some of the operators work very differently, there are some problems with UTF-8, we can’t support SJIS or EUC, nor posix-classes. And java.util.regex also uses a recursive implementation which means it can’t handle certain large inputs. Further, we would like to be able to modify the implementation to work with the same stuff that backs the RubyString, to increase performance.

Stage 1: JRegex
Yesterday (March 11:th), I merged JRegex as the main regular expression engine for JRuby. The main reasons for this is twofold. First, we can change the implementation quite easily, and second, it is an iterative algorithm, which means it doesn’t fail on the input that java.util.regex does. Since this caused problems with some Rails tests (and also in multipart handling in all libraries using cgi), I decided to merge this as a stopgap until the next incarnation of regex support.

Stage 2: REJ
In about 2 weeks, I hope to be able to merge REJ with JRuby. At the point of merging, it should be a better replacement than both JRegex and java.util.regex. REJ is a project I’ve started, which will be a direct port of the MRI 1.8.6 regular expression engine. The important thing about this is that the semantics for JRuby will match MRI very closely. We will be able to match UTF-8, SJIS and EUC regular expressions, and we are able to have the same quirks as MRI, even though people shouldn’t depend on such quirks. In the process of writing REJ, I will also create a large suite of test cases for regular expressions, based on Henry Spencer’s test files. I’ll probably submit something initial to a separate repository very soon. If I get my wish, REJ is what will be the regular expression engine for JRuby 1.0.

Stage 3: Ojiguruma
After 1.0 has been released, I think it’s time to make the Regexp engine in JRuby really extensible, and provide an interface from Ruby to change which engine to use. After that is done, I would be very interested in doing a port of Oniguruma to Java, which would give us far better multilanguage support, and also some interesting features. The reason I’m choosing to not do this right now is because Oniguruma is just too large.

Stage 4: (No official name yet)
Another engine that some in the JRuby/Ruby community has started working on is an engine which will be based on Ragel for parsing and a modified version of Thompson NFA and TCL-style backreferences for matching. It’s an interesting project but it will take some time before it’s usable.