New Hpricot for JRuby


One of the annoying things for JRuby have been lack of recent support for Hpricot. Hopefully that will all be in the past now, or at least in a few days time. I spent some time hacking the latest version of Hpricot to work on JRuby, and as of now, my fork of it runs all tests like it should.

I first want to talk quickly about Nokogiri. Someone on Twitter mentioned that we should be using Nokogiri instead of Hpricot. That’s all good and well, except Nokogiri depends on libxml2 and libxslt2. I know I have done some crazy porting of C-libraries to Java earlier, but I won’t do those two. Ever. So the only chance Nokogiri will have is if someone reimplements the backend to use native Java XML libraries. That wouldn’t be extremely hard, but it would definitely be more time consuming than fixing Hpricot.

So, that’s why I’ve fixed Hpricot. I’ve sent a pull request to Why, and hopefully the Java stuff will all be in his repository soon. Until then you can build your own version of the gem by cloning http://github.com/olabini/hpricot, and installing that.

For those of you who have used the earlier JRuby compatible Hpricot versions, I dare say this version is much faster in many ways. I learned a bit about how to write these kind of ports when doing Yecht, and I think that shows.



Hpricot goodness


This is just so cool, I cannot contain it. For those of you who haven’t heard about Hpricot, it is one of why the lucky stiff‘s incredibly cool tools (which he probably will use to take over the world any day now…). It’s HTML parsing goodness, very flexible, with the goal of being able to parse (and fix) everything that Firefox handles.

“So what?” you’re probably asking… Well, Hpricot uses Ragel and some C code to achieve blinding speed. This means JRuby can’t run it. Or I should say couldn’t run it:


orpheus:~/workspace/jruby> jruby bin/gem install hpricot --source http://code.whytheluckystiff.net
Bulk updating Gem source index for: http://code.whytheluckystiff.net
Select which gem to install for your platform (java)
1. hpricot 0.5.110 (jruby)
2. hpricot 0.5.110 (mswin32)
3. hpricot 0.5.110 (ruby)
4. hpricot 0.5 (ruby)
5. hpricot 0.5 (mswin32)
6. hpricot 0.5.0 (ruby)
7. hpricot 0.5.0 (mswin32)
8. hpricot 0.4.99 (ruby)
9. hpricot 0.4.99 (mswin32)
10. hpricot 0.4.92 (ruby)
11. hpricot 0.4.92 (mswin32)
12. Skip this gem
13. Cancel installation
> 1
Successfully installed hpricot-0.5.110-jruby
Installing ri documentation for hpricot-0.5.110-jruby...
Installing RDoc documentation for hpricot-0.5.110-jruby...

That’s right, Hpricot is now more promiscuous than any other gem with native parts.
What can you do with it? Well, I’m just going to point you to _why’s own description of it. All he says at http://code.whytheluckystiff.net/hpricot/ will work fine in JRuby!

How did this come to be? Well, me and _why did some joint hacking, which was helped along by the fact that Adrian Thurston (the genius behind Ragel) recently added Java support to it. So, basically, most of the Ragel definition is exactly the same for both the C and the Java versions. The native code has been factored out, and both versions are buildable with rake from _why’s code repository.

This is important. Don’t think anything else. This strategy will, and can, be used for other gems with native parts. It’s just a question of time.