Know your Regular Expression anchors


As everyone knows, regular expressions are incredibly important in many programming tasks. So it pays to know some of the particulars of the regexp syntax. One example that bit me a while back was a simple oversight – something I did know but hadn’t kept in mind while writing the bad code. Namely, the way the caret (^) works when used in a String with newlines in it. To be fair I’ve been using Java regexps for a while and that problem doesn’t exist there.

To illustrate the difference, here is a program you can run in either MRI or JRuby. If running in JRuby you’ll see that the Java version needs the flag MULTILINE to behave as Ruby does by default.

str = "one\nover\nyou"
puts "Match with ^"
str.gsub(/^o/) do |e|
p $~.offset(0)
e
end

puts "Match with \\A"
str.gsub(/\Ao/) do |e|
p $~.offset(0)
e
end


if defined?(JRUBY_VERSION)
require 'java'
regexp = java.util.regex.Pattern.compile("^o", java.util.regex.Pattern::MULTILINE)
matcher = regexp.matcher(str)
puts "Java match with ^"
while matcher.find()
p matcher
end

regexp = java.util.regex.Pattern.compile("\\Ao", java.util.regex.Pattern::MULTILINE)
matcher = regexp.matcher(str)
puts "Java match with \\A"
while matcher.find()
p matcher
end
end

So, what’s the lesson here? Don’t use caret (^) and dollar ($) if you actually want to match the beginning or the end of the string. Instead, use \A and \Z. That’s what they’re there for.


2 Comments, Comment or Ping

  1. Mike Owens

    Whoa, thanks for the heads up. I wrote in Python for years before learning Ruby, and just assumed they both used the same MULTILINE default.

    It’s gonna be pretty hard to break the “\A == ^” mindset.

    It’s very rare that a blog post makes me go back and scan every file I’ve written in the past two years. Time to get pickier with the unit tests.

    October 3rd, 2007

  2. Dr Nic

    +1 mike – I also never knew there was any difference. Patooey.

    October 3rd, 2007

Reply to “Know your Regular Expression anchors”