Language generation


This post is the first in a series of posts about PAIPr. Read here for more info about the concept.

Today I would like to start with taking a look at Chapter 2. You can find the code in lib/ch02 in the repository.

Chapter 2 introduces Common Lisp through the creation of a series of ways of doing generation of language sentences in English, based on simple grammars. It’s an interesting chapter to start with since the code is simple and makes it easy to compare the Ruby and Common Lisp versions.

The first piece to take a look at is the file common.rb, which contains two methods we’ll need later on:

require 'pp'

def one_of(set)
  [set.random_elt]
end

class Array
  def random_elt
    self[rand(self.length)]
  end
end

As you can see I’ve also required pp, to make it easier to print structures later on.

Both one_of and Array.random_elt are extremely simple methods, but it’s still nice to have the abstraction there. I’m retaining the naming from the book for these two methods.

The first real example defines a grammar by directly using methods. (From simple.rb):

require 'common'

def sentence; noun_phrase + verb_phrase; end
def noun_phrase; article + noun; end
def verb_phrase; verb + noun_phrase; end
def article; one_of %w(the a); end
def noun; one_of %w(man ball woman table); end
def verb; one_of %w(hit took saw liked); end

As you can see, all the methods just define their structure by combining the result of more basic methods. A noun phrase is an article, then a noun. An article is either ‘the’ or ‘a’, and a noun can be ‘man’, ‘ball’, ‘woman’ or ‘table’. If you run sentence a few times you will see that you sometimes get back quite sensible sentences, like [“a”, “ball”, “hit”, “the”, “table”]. But you will also get less interesting things, such as [“a”, “ball”, “hit”, “a”, “ball”]. At this stage the space for variation is quite limited, but you can still see a simplified structure of the English language in these methods.

To create an example that involves some more interesting structures, we can introduce adjectives and prepositions. Since these can be repeated zero times, or many times, we’ll use a production called PP* and Adj* (pp_star and adj_star in the code). This is from simple2.rb:

require 'simple'

def adj_star
  return [] if rand(2) == 0
  adj + adj_star
end

def pp_star
  return [] if rand(2) == 0
  pp + pp_star
end

def noun_phrase; article + adj_star + noun + pp_star; end
def pp; prep + noun_phrase; end
def adj; one_of %w(big little blue green adiabatic); end
def prep; one_of %w(to in by with on); end

Nothing really changes here, except that in both the optional productions we randomly return an empty array 50% of the time. They then call themselves recursively. The noun phrase production also changes a bit, and adj and prep gives us the two new terminals needed. If you try this one, you might get some more interesting results, such as: [“a”, “table”, “took”, “a”, “big”, “adiabatic”, “man”]. It’s still nonsensical of course. And it seems that this approach with randomness generates quite large output in some cases. To make it really nice there should probably be a diminishing bias in the adjectives and prepositions based on the length of the already generated string.

Another problem with this approach is that it’s kinda unwieldy. Using methods for the grammar is probably not the right choice long term. More specifically, we are tied to this implementation by having the grammar be represented as methods.

A viable alternative is to represent everything as a grammar definition – using a rule based solution. The first part of rule_based.rb looks like this:

require 'common'

# A grammar for a trivial subset of English
$simple_grammar = {
  :sentence => [[:noun_phrase, :verb_phrase]],
  :noun_phrase => [[:Article, :Noun]],
  :verb_phrase => [[:Verb, :noun_phrase]],
  :Article => %w(the a),
  :Noun => %w(man ball woman table),
  :Verb => %w(hit took saw liked)}

# The grammar used by generate. Initially this is $simple_grammar, but
# we can switch to other grammars
$grammar = $simple_grammar

Note that I’m using double arrays for the productions that aren’t terminal. There is a reason for this that will be more pronounced in the later grammars based on this. But right now it’s easy to see that a production is either a list of words, or a list of list of productions. Production names beginning with a capital is a terminal – this is a convention in most grammars. I didn’t use capital letters for the terminals when using methods because Ruby methods named like that causes additional trouble when calling them.

Now that we have the actual grammar we also need a helper method. PAIP defines rule-lhs, rule-rhs and rewrites, but the only one we actually need here is rewrites. (From rule_based.rb):

def rewrites(category)
  $grammar[category]
end

And actually, we could do away with it too, but it reads better than an index access.

The final thing we need is the method that actually creates a sentence from the grammar. It looks like this:

def generate(phrase)
  case phrase
  when Array
    phrase.inject([]) { |sum, elt|  sum + generate(elt) }
  when Symbol
    generate(rewrites(phrase).random_elt)
  else
    [phrase]
  end
end

If what we’re asked to generate is an array, we generate everything inside of that array, and combine them. If it’s a symbol we know it’s a production, so we get all the possible rewrites and take a random element from it. Currently every production have one rewrite, so the random_elt isn’t strictly necessary – but as you’ll see later it’s quite nice. And finally, if phrase is not an Array or Symbol, we just return the phrase as the generated element.

I especially like the use of inject as a more general version of (mappend #’generate phrase). Of course, for readability it would have been possible to implement mappend too:

def mappend(sym, list)
  list.inject([]) do |sum, elt|
    sum + self.send(sym, elt)
  end
end

But I choose to use inject directly instead, since it’s more idiomatic. Note that this version of mappend doesn’t work exactly the same as Common Lisp mappend, since it doesn’t allow a lambda.

Getting back to the generate method. If you were to run generate(:sentence), you would get the same kind of output as with the method based version – with the difference that changing the rules is much simpler now.

So for example, you can use this code from bigger_grammar.rb, which creates a larger grammar definition and then sets the default grammar to use it:

require 'rule_based'

$bigger_grammar = {
  :sentence => [[:noun_phrase, :verb_phrase]],
  :noun_phrase => [[:Article, :'Adj*', :Noun, :'PP*'], [:Name],
                   [:Pronoun]],
  :verb_phrase => [[:Verb, :noun_phrase, :'PP*']],
  :'PP*' => [[], [:PP, :'PP*']],
  :'Adj*' => [[], [:Adj, :'Adj*']],
  :PP => [[:Prep, :noun_phrase]],
  :Prep => %w(to in by with on),
  :Adj => %w(big little blue green adiabatic),
  :Article => %w(the a),
  :Name => %w(Pat Kim Lee Terry Robin),
  :Noun => %w(man ball woman table),
  :Verb => %w(hit took saw liked),
  :Pronoun => %w(he she it these those that)}

$grammar = $bigger_grammar

This grammar includes some more elements that make the output a bit better. For example, we have names here, and also pronouns. One of the reasons this grammar is easier to use is because we can define different versions of the productions. So for example, a noun phrase can be the same as we defined earlier, but it can also be a single name, or a single pronoun. We use this to handle the recursive PP* and Adj* productions. You can also see why the productions are defined with an array inside an array. This is to allow choices in this grammar.

A typical sentence from this grammar (calling generate(:sentence)) gives [“Terry”, “saw”, “that”], or [“Lee”, “took”, “the”, “blue”, “big”, “woman”].

So it’s easier to change these rules. Also believe that it’s easier to read, and understand the rules here. But one of the more important changes with the data driven approach is that you can use the same rules for different purposes. Say that we want to generate a sentence tree, which includes the name of the production used for that part of the tree. That’s as simple as defining a new generate method (in generate_tree.rb):

require 'bigger_grammar'

def generate_tree(phrase)
  case phrase
  when Array
    phrase.map { |elt| generate_tree(elt) }
  when Symbol
    [phrase] + generate_tree(rewrites(phrase).random_elt)
  else
    [phrase]
  end
end

This code follows the same pattern as generate, with a few small changes. You can see that instead of appending the results from the Array together, we instead just map every element. This is because we need more sub arrays to create a three. In the same manner when we get a symbol we prepend that to the array generated. And actually, at this point it’s kinda interesting to take a look at the Lisp version of this code:

(defun generate-tree (phrase)
  (cond ((listp phrase)
         (mapcar #'generate-tree phrase))
        ((rewrites phrase)
         (cons phrase
               (generate-tree (random-elt (rewrites phrase)))))
        t (list phrase)))

As you can see, the structure is mostly the same. I made a few different choices in representation, which means I’m checking if the phrase is a symbol instead of seeing if the rewrites for a symbol is non-nil. The call to mapcar is equivalent to the Ruby map call.

What does it generate then? Calling it with “pp generate_tree(:sentence)” I get something like this:

[:sentence,
 [:noun_phrase, [:Name, "Lee"]],
 [:verb_phrase,
  [:Verb, "saw"],
  [:noun_phrase,
   [:Article, "the"],
   [:"Adj*",
    [:Adj, "green"],
    [:"Adj*"]],
   [:Noun, "table"],
   [:"PP*"]],
  [:"PP*"]]]

which maps neatly back to our grammar. We can also generate all possible sentences for a grammar without recursion, using the same data driven approach.

The code for that can be found in generate_all.rb:

require 'rule_based'

def generate_all(phrase)
  case phrase
  when []
    [[]]
  when Array
    combine_all(generate_all(phrase[0]),
                generate_all(phrase[1..-1]))
  when Symbol
    rewrites(phrase).inject([]) { |sum, elt|  sum + generate_all(elt) }
  else
    [[phrase]]
  end
end

def combine_all(xlist, ylist)
  ylist.inject([]) do |sum, y|
    sum + xlist.map { |x| x+y }
  end
end

If you run generate(:sentence) you will get back a list of all 256 possible sentences from this simple grammar. In this case the algorithm is a bit more complicated. It’s also using the common Lisp idiom of working on the first element of a list and then recur on the rest of it. This makes it possible to combine everything together. I assume that it should be possible to devise something suitably clever based on the new Array#permutations or possible Enumerable#group_by or zip.

It’s interesting how well the usage of mappend and mapcar maps to uses of inject and map in this code.

Note that I’ve been using globals for the grammars in this implementation. An alternative that is probably better is to pass along an optional parameter to the methods. If no grammar is supplied, just use the default constant instead.

Anyway, the code for this chapter is in the repository. Play around with it and see if you can find anything interesting. This code is definitely an introduction to Lisp, more than a serious AI program – although it does show the kind of approaches that have been used for primitive code generation.

The next chapter will talk about the General Problem Solver. Until then.



Simplified finders in DataMapper


It seems I’m spending lots of time reading about DataMapper currently, while planning the first refactoring of Ribs. While reading I keep seeing Sam’s examples of simplified finders compared to the ActiveRecord versions. A typical example of the kind of thing I’m talking about is something like this with ActiveRecord:

Exibition.find(:all, :conditions => ["run_time > ? AND run_time < ?", 2, 5])

Which with DataMapper would be:

Exibition.all(:run_time.gt => 2, :run_time.lt => 5)

Oh, and yeah, it the typo is directly from the DataMapper documentation. So what’s wrong here? Well, in most cases it probably does exactly what you want it too. And that’s the problem in itself. If you use these simplified finders with more than one argument, you will get subpar SQL queries in some cases. Simply, if you don’t control the order of the clauses separated by AND, you might get queries that perform substantially worse than they should. Of course, that doesn’t happen often, but it’s important to keep it in mind.

And incidentally, I really do hate the methods added to Symbol.



Paradigms of Artificial Intelligence Programming (in Ruby)


Since I never can get enough projects, I’ve decided to start on something I’ve thought about doing for a long time. There are several reasons for doing it, but foremost among them is that I want to get back to playing with AI, and I want to have a project with many small pieces that I can do when I have some time over. If it ends up being educational or useful for others at the same time, well, I’m not complaining!

So what is it?

Well, first I’d like to introduce a book. It’s called Paradigms of Artificial Intelligence Programming or PAIP for short. It’s written by Peter Norvig, who’s also written several other books about AI. Currently he’s the Director of Research at Google. PAIP is probably both my favorite book about AI, and my favorite book about Common Lisp. It’s a really great book. Really. If you have any interest in one of those two subjects you should get hold of it. PAIP doesn’t cover cutting edge AI, but rather takes the historic view and looks at several examples from different eras, going from the first programs to some later, quite advanced things.

I’ve read it numerous times, went through the code and tweaked it and so on. It’s lots of fun. But that was a few years ago. So basically, what I want to do is to go through the book again. But this time I’m going to write all the programs in Ruby – converting them from Common Lisp and then maybe tweak them a bit to make them more idiomatic. And I’m planning to post it here. Or rather, I’m going to post the actual source code to http://github.com/olabini/paipr. I’m going to blog about all the code I write. You don’t necessarily have to have the book, since I’m going to surround the code with some descriptions and explanations.

Once again then, why should anyone care? Well, I don’t know. No one might care. But for me personally it’s going to be an interesting experience converting idiomatic Common Lisp into idiomatic Ruby. It’s going to be fun to revisit the old approaches to AI. And it might serve as a good, code heavy introduction to the subject for anyone interested in it.

I do have the permission from Peter Norvig to do this. The Ruby code I write is covered by the MIT license, while any Lisp code posted as part of this exercise is covered by the license here: http://norvig.com/license.html.

Also note that I sometimes won’t write the most obvious Ruby – it will be good to have a few links to the original Lisp code.



Evil hook methods?


I have come to realize that there are a few hook methods I really don’t like in Ruby. Or actually, it’s not the hook methods I have a problem with – it’s the way much code is written using them. The specific hooks that seems to cause the most trouble for me is included, extended, append_features and extend_features. Let me first reiterate – I don’t dislike the methods per se. The power they give the language is incredible and should not be underestimated. What I dislike is the way it makes things un-obvious when reading code that depends on them.

Let’s take a silly example:

module Ruby;end

module Slippers
  def self.included(other)
    other.send :include, Ruby
  end
end

class Judy
  include Slippers
end

p Judy.included_modules

Since all this code is in the same place, you can see what will happen when someone include Slippers. And really, in this case the side effect isn’t entirely dire. But imagine that this was part of a slightly larger code base. Like for example Rails. And the modules were spread over the code base. And the included hook did a few more things with your class. No way of knowing what – except reading the code – and the Ruby idiom is that include will add some methods and constants to your class and that is it. Anything else is going outside what the core message of that statement is.

One of the most common things you see with the included hook is something like this:

module Slippers
  module ClassMethods
  end

  def self.included(other)
    other.send :extend, ClassMethods
  end
end

class Judy
  include Slippers
end

This will add some class methods to the class that includes this module. DataMapper does this in the public API, for example. It’s very neat, you only have to include one thing and you get stuff on both your instances and your classes. Except that’s not what include does. Not really. So say you’re debugging someone’s code and happen upon an include statement. You generally don’t check what it’s doing until you’ve exhausted most other options.

So what’s wrong with having a public API like this?

module Slippers
  module ClassMethods;end
end

class Judy
  include Slippers
  extend Slippers::ClassMethods
end

where you explicitly include the Slipper module and then extend the class methods. This is more obvious code, it doesn’t hide anything behind your expectations, and it also might give me the possibility to choose. What if I want most of the DataMapper instance methods, but really don’t want to have finders on my class? Maybe I want to have a repository pattern? In that case I’ll have to explicitly remove all class methods introduced by that include, because there is no way of choosing if I want to have the class methods or not.

So that’s another benefit of dividing the extending out from the included hook. And finally, what about all those other things that people do in those hooks? Well, you don’t really need it. Make it part of the public API too! Instead of this:

module Slippers
  def self.included(other)
    do_funky_madness_on other
  end
end

class Judy
  include Slippers
end

make it explicit, like this:

module Slippers;end

class Judy
  include Slippers
  Slippers.do_funky_madness_on self
end

This is really just good design. It makes the functionality explicit, it makes it possible for the user to choose what he wants without doing monkey patching. And it makes the code easier to read. Yeah, I know, this will mean more lines of code. Booo hooo! I know that Ruby people are generally obsessed with making their libraries as easy to use as possible, but it feels like it sometimes goes totally absurd and people stop thinking about readability, maintainability and all those other things. And really, Ruby is such a good language that a few more lines of code like this still won’t make a huge impact on the total lines of code.

I’m not saying I haven’t done this, of course. But hopefully I’ll get better at it. And I’m not saying not to use these methods at all – I’m just saying that you should use them with caution and taste.



RSA parameters in OpenSSL, Ruby and Java


I would just like to publish this information somewhere, so that Google can help people find it easier than I did.  If you have ever wondered how the internal OpenSSL RSA parameters map to the Java parameters on RSAPrivateCrtKey, this little table will probably help you a bit. There are three different names in motion here. The first one is the internal field names in OpenSSL. These are also used as method names in Ruby. The second name is what gets presented when you use something like to_text on an RSA key. The third name is what it’s called in Java.

  • n == modulus == modulus
  • e == public exponent == publicExponent
  • d == private exponent == privateExponent
  • p == prime1 == primeP
  • q == prime2 == primeQ
  • dmp1 == exponent1 == primeExponentP
  • dmq1 == exponent2 == primeExponentQ
  • iqmp == coefficient == crtCoefficient


Ruby HTTPS web calls


As I noted in my entry on Ruby security, VERIFY_NONE is used all over the place. And what I realized when I tried to use VERIFY_PEER was that it really doesn’t work for net/https, and doesn’t seem to ever have worked for me. I got a bit mystified by this since I couldn’t find much mention about it online. And then Victor Grey came to the rescue in one of the comments. The solution is to not use net/https at all, but instead use the httpclient gem (formerly called http-access2). So do a ‘gem install httpclient’. Then you can use this code:

require 'rubygems'
require 'httpclient'

clnt = HTTPClient.new
puts clnt.get_content("https://www.random.org/")

This will just work. Under the covers, httpclient uses VERIFY_PEER as default. And you can see this by changing the hostname from www.random.org to random.org. That will generate a verification error directly. Awesome. So what’s the lesson? Never use net/https, folks!



Ruby Security quick guide


I’ve looked around a bit and it seems that there is no real good guide to security programming in Ruby. Neither is there any book available (although Maik Schmidts book Enterprise Recipes with Ruby and Rails will be the best reference once it arrives). The aim for this blog entry will be to note a few things you often would like to do, and how you can do it with Ruby. The focus will be mostly on the cryptographic APIs for Ruby, which doesn’t have much documentation either. In fact, the best documentation for this is probably the OpenSSL documentation, once you learn how to map it to the Ruby libraries.

You want to avoid clear text passwords in your database

One of the very good properties of handling passwords, is that you usually don’t need to actually know what they are. The only thing you need to be able to do is to reset them if someone forgets their password, and verify the correct combination of username and password. In fact, many practitioners feel better if they don’t need to have the responsibility of knowing the passwords of every user on their system, and conversely I feel much better knowing that my password is secure from even the administrators of the system. Not that I ever use the same password to two different systems, but someone else might… =)

So how do you solve this easily? Well, the easiest way – and also the most common way – is to use a digest. A digest is a mathematical function that takes as input a series of numbers of any length and returns a large number that represents the input text. There are three properties of digests that makes them useful: the first is that a small change in the input text generate a large effect in the outcome data, meaning that olabini and olbbini will have quite different digests.

The second is digests generally have a good distribution and small risk of collisions – meaning that it’s extremely unlikely that two different texts have the same output for a specific digest algorithm. The third property is that they are fundamentally one way mathematical operations – it’s very easy to go from plain text to digest, but extremely hard to go in the other direction.

There are currently several algorithms used for digests. MD5 and SHA-1 is by far the most common ones. If you can avoid MD5, do so, because there have recently been some successful attacks against it. The same might happen with SHA-1 at some point, but right now it’s the best algorithm for these purposes. Oh, and avoid doing a double digest – digesting the output of a digest algorithm – since this generally makes the plain text easier to crack, not harder.

Let’s see some code:

require 'openssl'

digest = OpenSSL::Digest::SHA1.hexdigest("My super secret pass")
p digest # => "dd5f30310682e5b41e122c637e8542b1b39466cf"

digest = OpenSSL::Digest::SHA1.hexdigest("my super secret pass")
p digest # => "f923786cc72ed61ae31325b6e8e285e6c35e6519"

d = OpenSSL::Digest::SHA1.new
d << "first part of pass phrase"
d << "second part of pass phrase"
p d.hexdigest # => "f13d7bdee0634c017babb8c72dcebe18f9e0598e"

I have used two different variations here. The first one, calling a method directly on the SHA1 class, is useful when you have a small string that you want to digest immediately. It’s also useful when you won’t need different algorithms or send the digest object around. The second method allows you to update the string data to digest several times and the finally generate the end digest from that. I have taken the liberty of using the methods called hexdigest for both cases – this is because they are more readable. If you replace hexdigest with digest, you will get back a Ruby string that contains characters from 0 to 255 instead, which means they doesn’t print well. But if you were to compare the result of hexdigest and digest, you will see that they return the same data, just in different formats.

So as you can see, this is extremely easy to incorporate in your code. To verify a password you just make a digest of the password the person trying to authenticate sent in to you, and compare that to what’s in your database. Of course, the approach is generalizable to other cases when you want to protect data in such a way that you can’t recover the data itself.

Finally, be careful with this approach. You should generally combine the password with something else, to get you good security. There are attacks based on something called rainbow tables that makes it easy to find the password from a digested password. This can be avoided using a salt or other kind of secret data added to the password.

You want to communicate with someone securely
When you want to so send messages back and forth between actors, you generally need a way to turn it unreadable and then turn it back into something readable again. The operation you need for this is called a cipher, but there are many kinds of ciphers, and not all of them are right. For example, rot-13 be considered a cipher, but there is no real security inherent in it.

In general, what you want for communication between two parties is a symmetric cipher using a secret key. A symmetric cipher means that you can use the same key to “lock” and “unlock” a message – or encrypt and decrypt it. There are other kinds of ciphers that are very useful, which we’ll see in the next example, but symmetric ciphers are the most commonly used ciphers since they are quite easy to use and also very efficient. For a symmetric cipher to be completely safe, the key should always be as long as the data that should be encrypted. Generally this doesn’t happen, since key distribution becomes a nightmare, but the current approaches are reasonably sure against most kinds of cracking. Coupled with asymmetric ciphers, they become extremely useful.

There are three things you need to use a symmetric cipher. The first one is the algorithm. There are loads of different kinds of symmetric ciphers around. Which kind you will use doesn’t matter that much as long as you choose something that is reasonably sure. One of the more widely used algorithms is called DES. In it’s original form DES should definitely not be used (since the key length is only 56 bits, it’s actually not that hard to crack it.). There is another form of DES called Triple DES, or DES3, which effectively gives you more security. DES3 might work in some circumstances, but I would recommend AES in almost all cases. AES come in three varieties called AES-128, AES-192 and AES-256. The difference between them is the key length. As you might guess, AES-128 needs a 128 bit long key, AES-192 a 192 bit long and AES-256 a 256 bit long. These key lengths all give reasonable security. The more security you need the longer key algorithm you can choose. The tradeoff is that the longer the key is, the more time it will take to encrypt and decrypt messages.

Once you have an algorithm, you need a key. This can come in two varieties – either you get a key from a humans, where that key is generally a password. Otherwise you might want to automatically generate a key. This should be done with a secure random number generator – NOT with rand().

Depending on which algorithm you choose, you might also need something called an IV (initialization vector). The algorithms that require an IV is called cyclic block ciphers (CBC) and will work on a small amount of bytes at the same time. In the case of AES-128, a block of 16 bytes are generated on every cycle of the algorithm. These 16 bytes will be based on the algorithm, the key, and the 16 bytes generated the last time. The problem is that the first time there were no bytes generated, which means these will have to be initialized another way and this is where the IV comes in. It’s just the first block that will be used for generating the first real block of data. The IV does not need to be secured and can be sent in the clear. In Ruby, if you don’t provide an IV when initializing your cipher, the default will be a part of the string “OpenSSL for Ruby rulez!”. Depending what length the IV should have, a substring will be used.

So, let’s take a look at some code. This code will just encrypt a message with a password and then decrypt it again:

require 'openssl'

cipher = OpenSSL::Cipher::AES128.new("CBC")
cipher.encrypt
cipher.key = "A key that is 16"
cipher.iv =  "An IV that is 16"

output = ""

output << cipher.update("One")
output << cipher.update(" and two")
output << cipher.final

p output # => "\023D\aL\375\314\277\264\256\245\225\a\360|\372+"

cipher = OpenSSL::Cipher::AES128.new("CBC")
cipher.decrypt
cipher.key = "A key that is 16"
cipher.iv =  "An IV that is 16"

output2 = ""

output2 << cipher.update(output)
output2 << cipher.final

p output2 # => "One and two"

We first create a cipher instance, giving it CBC as the type. This is the default but will generate a warning if you don’t supply. We tell the cipher to encrypt, then initialize it with key= and iv=. The key is 16 bytes because it’s a 128 bit cipher, and the IV is 16 bytes because that’s the block size of AES. Finally we call update on the data we would like to encrypt. We need to save the return value from the update call, since that’s part of the generated cipher text. This means that even encrypting really large texts can be very efficient, since you can do it in smaller pieces. Finally you need to call final to get the last generated cipher text. The only difference when decrypting is that we call decrypt on the object instead of encrypt. After that the same update and final calls are made. (Am I the only one who thinks that encrypt and decrypt should be called encrypt! and decrypt!)?

That’s how easy it is to work with ciphers in Ruby. There are some complications with regards to padding, but you generally don’t need to concern yourself with that in most applications.

You want nodes to be able to communicate with each other securely without the headache of managing loads of secret keys

OK, cool, symmetric ciphers are nice. But they have one weakness, which is the secret key. The problem shows up if you want to have a network of computers talking to each other securely. Of course, you can have a secret key for all of them, which means that all nodes in the network can read messages to other parties. Or you can have a different secure key for each combination of nodes. That’s ok if you have 3 nodes (when you will just need 3 secret keys). But if you have a network with a 100 nodes that all need to communicate with each other you will need 4950 keys. It will be hard to distribute these keys since they need to be kept secure, and generally just managing it all will be painful.

The solution to this problem is called asymmetric ciphers. The most common form of this is public key cryptography. The idea is that you need one key to encrypt something, and another key to decrypt it. If you have key1 and encrypt something with it, you can’t decrypt that something with key1. This curious property is extremely useful. In the version of public key cryptography you generate two keys, and then you publish one of the keys widely around. Since the public key can’t be used to decrypt content there is no security risk in not keeping it secret. The private key should obviously be kept private, since you can generate the public key from it. What does this mean in practice? Well, that in your 100 node network, each node can have a key pair. When node3 wants to send a message to node42, node3 will ask around for the public key of node42, encrypt his message with that and send it to node42. Finally, node42 will decrypt the message using his private key.

There is one really large downside with these ciphers. Namely, it is quite expensive operations, even compared to other ciphers. So the way you generally solve this is to generate a random key for a symmetric cipher, encrypt the message with that key and then encrypt the actual key using the asymmetric cipher. This is how https works, it’s how SSH works, it’s how SMIME and PGP works. It’s generally a good way of matching the strengths and weaknesses of ciphers with each other.

So how do you use an asymmetric cipher? First you need to generate the keys, and then you can use them. To generate the keys you can use something like this:

require 'openssl'

key = OpenSSL::PKey::RSA.generate(1024)
puts key.to_pem
puts key.public_key.to_pem

This will give you the output in form of one private key and one public key in PEM format. Once you have this saved away somewhere, you can start distributing the public key. As the name says, it’s public so there is no problem with distributing it. Say that someone has the public key and wants to send you a message. If the PEM-encoded public key is in a file called public_key.pem this code will generate a message that can only be decrypted with the private key and then read the private key and decrypted the message again.

require 'openssl'

key = OpenSSL::PKey::RSA.new(File.read("public_key.pem"))
res = key.public_encrypt("This is a secret message. WOHO")
p res

key2 = OpenSSL::PKey::RSA.new(File.read("private_key.pem"))
res2 = key2.private_decrypt(res)
p res2

Note that the methods we use are called public_encrypt and private_decrypt. Since the public and private prefix is there, there has to be a reason for it. And there is, as you’ll see soon.

You want to prove that it was you and only you who wrote a message

Being able to send things in private is all good and well, but if you distribute you public key wide you may never know who you get messages from. Or rather, you know that you get a message from their addresses (if you’re using mail) – but you have no way of ensuring that the other person is actually who they say they are. There is a way around this problem too, using asymmetric ciphers. Interestingly, if you encrypt something using your private key, anyone with your public key can decrypt it. THat doesn’t sound very smart from a security perspective, but it’s actually quite useful. Since you are the only one with your private key, if someone can use your public key to decrypt it, then you have to be the one who wrote the message. This have two important consequences. First, you can always trust that the person who wrote you a message was actually the one writing it. And second, that person can never retract a message after it’s been written, since it’s been signed.

So how do you do this in practice? It’s called cryptographic signatures, and OpenSSL supports it quite easily. If you have the keys we created earlier, you can do something like this using the low level operations available on the keys:

require 'openssl'

pub_key = OpenSSL::PKey::RSA.new(File.read("public_key.pem"))
priv_key = OpenSSL::PKey::RSA.new(File.read("private_key.pem"))

text = "This is the text I want to send"
signature = priv_key.private_encrypt(text)

if pub_key.public_decrypt(signature) == text
  puts "Signature match"
else
  puts "Signature didn't match!"
end

There happens to be a slight problem with this code. It works, but if you want to send the signature along the message will always be twice the size of the original text. Also, the larger the text to encrypt the longer it takes. The way this is solved in basically all cases is to first create a digest of the text and then sign that. The code to do that would look like this:

require 'openssl'

pub_key = OpenSSL::PKey::RSA.new(File.read("public_key.pem"))
priv_key = OpenSSL::PKey::RSA.new(File.read("private_key.pem"))

text = "This is the text I want to send"*200

signature = priv_key.sign(OpenSSL::Digest::SHA1.new,text)

if pub_key.verify(OpenSSL::Digest::SHA1.new, signature, text)
  puts "Signature verified"
else
  puts "Signature NOT verified"
end

As you can see we use the sign and verify methods on the keys. We also have to send in the digest to use.

You want to ensure that a message doesn’t get modified in transit

Another usage of signatures, that we actually get for free together with them, is that there is no way to tamper with the message in transit. So even if you send everything in clear text, if the signature verification succeeds you know that the message can’t have been tampered with. The reason is simply this: if the text was tampered with, the digest that gets generated would be different, meaning that the signature would not be equal to what’s expected. And if someone tampered with the signature, it wouldn’t match either. Theoritically you can tamper with both the text and the signature, but then you’d first have to crack the private key used, since otherwise you would never be able to generate a correct signature. All of this comes for free with the earlier code example.

You want to ensure that you can trust someones public key
Another thing that can get troublesome with public and private keys is the distribution problem. When you have lots of agents you need to make sure that the public key you get from someone is actually their public key and that they are the ones they say. In real life this is a social problem as much as a technological. But the way you generally make sure that you can trust that someones public key is actually theirs is that you use somethong called a certificate. A certificate more or less is a signature, where the signer is someone you trust, and the text that is signed is the actual public key under question. If you trust someone that issues certificates, and that someone has issued a certificate saying that a public key belongs to a specific entity, you can trust that public key and use it. What’s more interesting is that certificates can also be signed, which means that you can trust someone, that someone can sign another certificate authority, which signs another certificate authority, which finally signs a public key. If you trust the root CA (certificate authority) you should also be able to trust all the certificates and public keys in the chain. This is how the infrastructure surrounding https works. You have a list of implicitly trusted certificates in your browser, and if an https site can’t be verified with this trust store, you will get one of those popup boxes asking if you really want to trust this site or not.

I will not show any code for this, since X509 certificates are actually quite well documented for Ruby. They are also a solution that takes some more knowledge to handle correctly, so a real book would be recommended.

You want to use https correctly

The protocol https is very useful, since it allows you to apply the whole certificate and asymmetric cipher business to your http communication. There is a library called net/https which works really well if you use it right. The problem is that I see lots and lots of example code that really doesn’t use it correctly. The main problem is the setting of verification level. If you have the constant VERIFY_NONE anywhere in your code, you probably aren’t as secure as you’d like to think. The real problem is that I’ve never been able to get Ruby to verify a certificate with VERIFY_PEER. No matter what I do in the manner of adding stores and setting ca_file and ca_path, I haven’t ever been able to make this work. It’s quite strange and I can’t seem to find much mention of it online, since everyone just uses VERIFY_NONE. That’s not acceptable to me, since if no verification happens, there is no way to know if the other party is the right web site.

This is one area where JRuby’s OpenSSL seems to work better, in fact. It verifies those sites that should be verified and not the others.

If anyone feels like illuminating this for me, I would be grateful – the state of it right now is unacceptable if you need to do really secure https connections.

Keep security in mind

My final advice is twofold and have nothing to do with Ruby. First, always keep security in mind. Always think about how someone externally can exploit what you’re doing, intercept your connections, or just listen to them. Programmers generally need to be much more aware of this.

The last thing I’ll say is this: start reading Cryptogram by Bruce Schneier. While not completely technical anymore, it talks about lots of things surrounding security and also gives a very good feeling about how you should think and approach security. Subscribe here.



Where is the Net::SSH bug


Yesterday I spent several hours trying to find the problem with our implementation of OpenSSL Cipher, that caused the Net::SSH gem to fail miserable during negotiation and password verification. After various false leads I finally found the reason for the strange behavior. But I really can’t decide if it’s a bug, and if it’s a bug where the bug is. Is it in Ruby’s interface to OpenSSL, or is it in Net::SSH?

No matter what cipher suite you use for SSH, you generally end up using a block cipher, mostly something like CBC. That means an IV (initialization vector) is needed, together with a key. The relevant parts of OpenSSL used is the EVP_CipherInit, EVP_CipherUpdate and EVP_CipherFinal family of methods. Nothing really strange there. The Ruby interface matches these methods quite closely; every time you set a key, or an IV, or some other parameter, the CipherInit method is called with the relevant data. When CipherUpdate is called, the actual enciphering or deciphering starts happening, and CipherFinal takes care of the final block.

At the point EVP_CipherFinal is called, nothing more should be done using the specific Cipher context. Specifically, no more Update operations should be used. The man page has this to say about the Final-methods:

After this function is called the encryption operation is finished and no further calls to EVP_EncryptUpdate() should be made.

Now, what I found was that same documentation is not part of the Ruby interface. And Net::SSH is actually reusing the same Cipher object after final has been called on it. Specifically, it continues the conversation, calling update a few times and then final. The general flow for a specific Cipher object in Net::SSH is basically init->update->update->final->update->update->final.

So what is so bad about this then? Well, the question is really this: what IV will the operations after the first final call be using? The assumption I made is that obviously it will use the original IV set on the object. Something else would seem absurd. But indeed, the IV used is actually the last IV-length bytes of encrypted data returned. Is this an obvious or intended effect at some level? Probably not, since the OpenSSL documentation says you shouldn’t do it. The reason it works that way is because the temporary buffer used in the Cipher context isn’t cleared out at the end of the call to final.

In contrast, the Java Cipher object will call reset() as part of the call to doFinal(). Where reset() will actually reset the internal buffers to use the original IV. So the solution is simple for encryption. Just save away 8 or 16 bytes of the last generated crypto text and set that manually as the IV after the call to doFinal. And what about decryption? Well, here the IV needs to be the last crypto text sent in for deciphering, not the result of the last operation.

So Net::SSH seems to work fine with JRuby now. I’m about to release a new version of JRuby-OpenSSL including these and many other things.

But the question remains. Is it a bug? If it is, is it in the Ruby OpenSSL integration, or in the Net::SSH usages of Ciphers? If it’s in the Net::SSH code, why does it actually work correctly when communicating with an SSH server? Or is this behavior of using the last crypto text as IV something documented in the SSH spec?

Enlightenment would be welcome.



Java and mocking


I’ve just spent my first three days on a project in Leeds. It’s a pretty common Java project, RESTful services and some MVC screens. We have been using Mockito for testing which is a first for me. My immediate impression is quite good. It’s a nice tool and it allows some very clean testing of stuff that generally becomes quite messy. One of the things I like is how it uses generics and the static typing of Java to make it really easy to make mocks that are actually type checked; like this for example:

Iterator iter = mock(Iterator.class);stub(iter.hasNext()).toReturn(false);

// Call stuff that starts interaction
verify(iter).hasNext();

These are generally the only things you need to stub stuff out and verify that it was called. The things you don’t care about you don’t verify. This is pretty good for being Java, but there are some problems with it too. One of the first things I noticed I don’t like is that interactions that isn’t verified can’t be disallowed in an easy way. Optimally this would happen at the creation of the mock, instead of actually calling the verifyNoMoreInteractions() afterwards instead. It’s way to easy to forget. Another problem that quite often comes up is that you want to mock out or stub some methods but retain the original behavior of others. This doesn’t seem possible, and the alternative is to manually create a new subclass for this. Annoying.

Contrast this to testing the same interaction with Mocha, using JtestR, the difference isn’t that much, but there is some missing cruft:

iter = mock(Iterator)
iter.expects(:hasNext).returns(false)

# Call stuff that starts interaction

Ruby makes the checking of interactions happen automatically afterwards, and so you don’t have any types you don’t need to care about most stuff the way you do in Java. This also shows a few of the inconsistencies in Mockito, that is necessary because of the type system. For example, with the verify method you send the mock as argument and the return value of the verify-method is what you call the actual method on, to verify that it’s actually called. Verify is a generic method that returns the same type as the argument you give to it. But this doesn’t work for the stub method. Since it needs to return a value that you can call toReturn on, that means it can’t actually return the type of the mock, which in turn means that you need to call the method to stub before the actual stub call happens. This dichotomy gets me every time since it’s a core inconsistency in the way the library works.

Contrast that to how a Mockito like library might look for the same interaction:

iter = mock(Iterator)
stub(iter).hasNext.toReturn(false)

# Do stuff
verify(iter).hasNext

The lack of typing makes it possible to create a cleaner, more readable API. Of course, these interactions are all based on how the Java code looked. You could quite easily imagine a more free form DSL for mocking that is easier to read and write.

Conclusion? Mockito is nice, but Ruby mocking is definitely nicer. I’m wondering why the current mocking approaches doesn’t use the method call way of defining expectations and stubs though, since these are much easier to work with in Ruby.

Also, it was kinda annoying to upgrade from Mockito 1.3 to 1.4 and see half our tests starting to fail for unknown reasons. Upgrade cancelled.



Testing Regular Expressions


Something has been worrying me a bit lately. Being test infected and all, and working for ThoughtWorks, where testing is part of the life blood, I think more and more about these issues. And one thing I’ve started noticing is that regular expressions seems to be a total blind spot in many cases. I first started thinking about it when I changed a quite complicated regular expression in RSpec. Now RSpec has coverage tests as part of their build, and if the test coverage is less than a 100%, the build will fail. Now, since I had changed something to add new functionality, but hadn’t added any tests for it, I instinctively assumed that it would be caught be the coverage tool.

Guess what? It wasn’t. Of course, if I had changed the regexp to do something that the surrounding code couldn’t support, one of the tests for surrounding lines of code would have caught it, but I got no mention from the coverage tool that I needed more tests to fully handle the regular expressions. This is logical if you think about it. There is no way that a coverage tool could find all the regular expressions in your source code, and then make sure that all branches and alternatives of that particular regular expression was exercised. So that means that the coverage tool doesn’t do anything with them at all.

OK, I can live with that, but it’s still one of those points that would be very good to keep in mind. Every time you write a regular expression in your code, you need to take special care to actually exercise that part of the code with many inputs. What is many in this case? That’s another part of the problem – it depends on the regular expression. It depends on how complicated it is, how long it is, how many special operators are used, and so on. There is no real way around it. To test a regular expression, you really need to understand how they work. The corollary is obvious – to use a regular expression in your code, you need to know how to test it. Conclusion – you need to understand regular expressions.

In many code bases I haven’t seen any tests for regular expressions at all. In most cases these have been crafted by writing them outside the code, testing them by hand, and then putting them in the code. This is brittle to say the least. In the cases where there are tests, it’s much more common that they only test positives, and not negatives. And I’ve seldom heard of code bases with enough tests for regular expressions. One of the problems is that in a language like Ruby, they are so easy to use, so you stick them in all over the place. A standard refactoring could help here, by extracting all literal regular expressions to constants. But then the problem becomes another – as soon as you use regular expressions to extract values from a string, it’s a pain to not have the regular expression at the same place as the extracted groups are used. Example:

PhoneRegexp = /(\d{3})-?(\d{4})-?(\d{4})/
# 200 lines of code
if phone_number =~ PhoneRegexp
  puts "phone number is: #$1-#$2-#$3"
end

If the regular expression had been at the same place as the usage of the $1, $2 and $3 it would have been easy to tie them to the parts of the string. In this case it would be easy anyway, but in more complicated cases it’s more complicated. The solution to this is easy – the dollar numbers are evil: don’t use them. Instead use an idiom like this:

area, number, extension = PhoneRegexp.match(phone_number).captures

In Ruby 1.9 you will be able to use named captures, and that will make it even easier to make readable usage of the extracted parts of a string. But fact is, the difference between the usage point and the definition point can still cause trouble. A way of getting around this would be to take any complicated regular expression and putting it inside of a specific class for only that purpose. The class would then encapsulate the usage, and would also allow you to test the regular expression more or less in isolation. In the example above, maybe creating a PhoneNumberParser would be a good idea.

At the end of the day, regular expressions are an extremely complicated feature, and in general we don’t test the usage of them enough. So you should start. Begin by first creating both positive and negative tests for them. Figure out the boundaries, and see where they can go wrong. Know regular expressions well enough to know what happens in these strange circumstances. Think about unicode characters. Think about whitespace. Think about greedy and lazy matching. As an example of something that took a long time to cause trouble; what’s wrong with this regexp that tries to discern if a string is a select statement or not?

/^\s*\(*\s*SELECT\W+/i

And this example actually covers most of the ground, already. It checks case insensitive. It checks for white space before any optional parenthesis, and for any white space after. It makes sure that the word SELECT isn’t continued by checking for at least one non word character. So what’s wrong with it? Well… It’s the caret. Imagine if we had a string like this:

"INSERT INTO foo(a,b,c)\nSELECT * FROM bar"

The regular expression will in fact match this, even though it’s not a select statement. Why? Well, it just so happens that the caret matches the beginning of lines, not the beginning of strings. The dollar sign works the same way, matching the end of lines. How do you solve it? Change the caret to \A and the dollar sign to \Z and it will work as expected. A similar problem can show up with the “.” to match any character. Depending on which language you are using, the dot might or might not match a newline. Always make sure you know which one you want, and what you don’t want.

Finally, these are just some thoughts I had while writing it. There is much more advice to give, but it can be condensed to this: understand regular expressions, and test them. The dot isn’t as simple as it seem. Regular expressions are a full blown language, even though it’s not turing complete (in most implementations). That means that you can’t test it completely, in the general case. This doesn’t mean you shouldn’t try to cover all eventualities.

How are you testing your regular expressions? How much?