As everyone knows, regular expressions are incredibly important in many programming tasks. So it pays to know some of the particulars of the regexp syntax. One example that bit me a while back was a simple oversight – something I did know but hadn’t kept in mind while writing the bad code. Namely, the way the caret (^) works when used in a String with newlines in it. To be fair I’ve been using Java regexps for a while and that problem doesn’t exist there.
To illustrate the difference, here is a program you can run in either MRI or JRuby. If running in JRuby you’ll see that the Java version needs the flag MULTILINE to behave as Ruby does by default.
str = "one\nover\nyou"
puts "Match with ^"
str.gsub(/^o/) do |e|
p $~.offset(0)
e
end
puts "Match with \\A"
str.gsub(/\Ao/) do |e|
p $~.offset(0)
e
end
if defined?(JRUBY_VERSION)
require 'java'
regexp = java.util.regex.Pattern.compile("^o", java.util.regex.Pattern::MULTILINE)
matcher = regexp.matcher(str)
puts "Java match with ^"
while matcher.find()
p matcher
end
regexp = java.util.regex.Pattern.compile("\\Ao", java.util.regex.Pattern::MULTILINE)
matcher = regexp.matcher(str)
puts "Java match with \\A"
while matcher.find()
p matcher
end
end
So, what’s the lesson here? Don’t use caret (^) and dollar ($) if you actually want to match the beginning or the end of the string. Instead, use \A and \Z. That’s what they’re there for.
2 Comments, Comment or Ping
Whoa, thanks for the heads up. I wrote in Python for years before learning Ruby, and just assumed they both used the same MULTILINE default.
It’s gonna be pretty hard to break the “\A == ^” mindset.
It’s very rare that a blog post makes me go back and scan every file I’ve written in the past two years. Time to get pickier with the unit tests.
October 3rd, 2007
+1 mike – I also never knew there was any difference. Patooey.
October 3rd, 2007
Reply to “Know your Regular Expression anchors”