Language generation

This post is the first in a series of posts about PAIPr. Read here for more info about the concept.

Today I would like to start with taking a look at Chapter 2. You can find the code in lib/ch02 in the repository.

Chapter 2 introduces Common Lisp through the creation of a series of ways of doing generation of language sentences in English, based on simple grammars. It’s an interesting chapter to start with since the code is simple and makes it easy to compare the Ruby and Common Lisp versions.

The first piece to take a look at is the file common.rb, which contains two methods we’ll need later on:

require 'pp'

def one_of(set)

class Array
  def random_elt

As you can see I’ve also required pp, to make it easier to print structures later on.

Both one_of and Array.random_elt are extremely simple methods, but it’s still nice to have the abstraction there. I’m retaining the naming from the book for these two methods.

The first real example defines a grammar by directly using methods. (From simple.rb):

require 'common'

def sentence; noun_phrase + verb_phrase; end
def noun_phrase; article + noun; end
def verb_phrase; verb + noun_phrase; end
def article; one_of %w(the a); end
def noun; one_of %w(man ball woman table); end
def verb; one_of %w(hit took saw liked); end

As you can see, all the methods just define their structure by combining the result of more basic methods. A noun phrase is an article, then a noun. An article is either ‘the’ or ‘a’, and a noun can be ‘man’, ‘ball’, ‘woman’ or ‘table’. If you run sentence a few times you will see that you sometimes get back quite sensible sentences, like [“a”, “ball”, “hit”, “the”, “table”]. But you will also get less interesting things, such as [“a”, “ball”, “hit”, “a”, “ball”]. At this stage the space for variation is quite limited, but you can still see a simplified structure of the English language in these methods.

To create an example that involves some more interesting structures, we can introduce adjectives and prepositions. Since these can be repeated zero times, or many times, we’ll use a production called PP* and Adj* (pp_star and adj_star in the code). This is from simple2.rb:

require 'simple'

def adj_star
  return [] if rand(2) == 0
  adj + adj_star

def pp_star
  return [] if rand(2) == 0
  pp + pp_star

def noun_phrase; article + adj_star + noun + pp_star; end
def pp; prep + noun_phrase; end
def adj; one_of %w(big little blue green adiabatic); end
def prep; one_of %w(to in by with on); end

Nothing really changes here, except that in both the optional productions we randomly return an empty array 50% of the time. They then call themselves recursively. The noun phrase production also changes a bit, and adj and prep gives us the two new terminals needed. If you try this one, you might get some more interesting results, such as: [“a”, “table”, “took”, “a”, “big”, “adiabatic”, “man”]. It’s still nonsensical of course. And it seems that this approach with randomness generates quite large output in some cases. To make it really nice there should probably be a diminishing bias in the adjectives and prepositions based on the length of the already generated string.

Another problem with this approach is that it’s kinda unwieldy. Using methods for the grammar is probably not the right choice long term. More specifically, we are tied to this implementation by having the grammar be represented as methods.

A viable alternative is to represent everything as a grammar definition – using a rule based solution. The first part of rule_based.rb looks like this:

require 'common'

# A grammar for a trivial subset of English
$simple_grammar = {
  :sentence => [[:noun_phrase, :verb_phrase]],
  :noun_phrase => [[:Article, :Noun]],
  :verb_phrase => [[:Verb, :noun_phrase]],
  :Article => %w(the a),
  :Noun => %w(man ball woman table),
  :Verb => %w(hit took saw liked)}

# The grammar used by generate. Initially this is $simple_grammar, but
# we can switch to other grammars
$grammar = $simple_grammar

Note that I’m using double arrays for the productions that aren’t terminal. There is a reason for this that will be more pronounced in the later grammars based on this. But right now it’s easy to see that a production is either a list of words, or a list of list of productions. Production names beginning with a capital is a terminal – this is a convention in most grammars. I didn’t use capital letters for the terminals when using methods because Ruby methods named like that causes additional trouble when calling them.

Now that we have the actual grammar we also need a helper method. PAIP defines rule-lhs, rule-rhs and rewrites, but the only one we actually need here is rewrites. (From rule_based.rb):

def rewrites(category)

And actually, we could do away with it too, but it reads better than an index access.

The final thing we need is the method that actually creates a sentence from the grammar. It looks like this:

def generate(phrase)
  case phrase
  when Array
    phrase.inject([]) { |sum, elt|  sum + generate(elt) }
  when Symbol

If what we’re asked to generate is an array, we generate everything inside of that array, and combine them. If it’s a symbol we know it’s a production, so we get all the possible rewrites and take a random element from it. Currently every production have one rewrite, so the random_elt isn’t strictly necessary – but as you’ll see later it’s quite nice. And finally, if phrase is not an Array or Symbol, we just return the phrase as the generated element.

I especially like the use of inject as a more general version of (mappend #’generate phrase). Of course, for readability it would have been possible to implement mappend too:

def mappend(sym, list)
  list.inject([]) do |sum, elt|
    sum + self.send(sym, elt)

But I choose to use inject directly instead, since it’s more idiomatic. Note that this version of mappend doesn’t work exactly the same as Common Lisp mappend, since it doesn’t allow a lambda.

Getting back to the generate method. If you were to run generate(:sentence), you would get the same kind of output as with the method based version – with the difference that changing the rules is much simpler now.

So for example, you can use this code from bigger_grammar.rb, which creates a larger grammar definition and then sets the default grammar to use it:

require 'rule_based'

$bigger_grammar = {
  :sentence => [[:noun_phrase, :verb_phrase]],
  :noun_phrase => [[:Article, :'Adj*', :Noun, :'PP*'], [:Name],
  :verb_phrase => [[:Verb, :noun_phrase, :'PP*']],
  :'PP*' => [[], [:PP, :'PP*']],
  :'Adj*' => [[], [:Adj, :'Adj*']],
  :PP => [[:Prep, :noun_phrase]],
  :Prep => %w(to in by with on),
  :Adj => %w(big little blue green adiabatic),
  :Article => %w(the a),
  :Name => %w(Pat Kim Lee Terry Robin),
  :Noun => %w(man ball woman table),
  :Verb => %w(hit took saw liked),
  :Pronoun => %w(he she it these those that)}

$grammar = $bigger_grammar

This grammar includes some more elements that make the output a bit better. For example, we have names here, and also pronouns. One of the reasons this grammar is easier to use is because we can define different versions of the productions. So for example, a noun phrase can be the same as we defined earlier, but it can also be a single name, or a single pronoun. We use this to handle the recursive PP* and Adj* productions. You can also see why the productions are defined with an array inside an array. This is to allow choices in this grammar.

A typical sentence from this grammar (calling generate(:sentence)) gives [“Terry”, “saw”, “that”], or [“Lee”, “took”, “the”, “blue”, “big”, “woman”].

So it’s easier to change these rules. Also believe that it’s easier to read, and understand the rules here. But one of the more important changes with the data driven approach is that you can use the same rules for different purposes. Say that we want to generate a sentence tree, which includes the name of the production used for that part of the tree. That’s as simple as defining a new generate method (in generate_tree.rb):

require 'bigger_grammar'

def generate_tree(phrase)
  case phrase
  when Array { |elt| generate_tree(elt) }
  when Symbol
    [phrase] + generate_tree(rewrites(phrase).random_elt)

This code follows the same pattern as generate, with a few small changes. You can see that instead of appending the results from the Array together, we instead just map every element. This is because we need more sub arrays to create a three. In the same manner when we get a symbol we prepend that to the array generated. And actually, at this point it’s kinda interesting to take a look at the Lisp version of this code:

(defun generate-tree (phrase)
  (cond ((listp phrase)
         (mapcar #'generate-tree phrase))
        ((rewrites phrase)
         (cons phrase
               (generate-tree (random-elt (rewrites phrase)))))
        t (list phrase)))

As you can see, the structure is mostly the same. I made a few different choices in representation, which means I’m checking if the phrase is a symbol instead of seeing if the rewrites for a symbol is non-nil. The call to mapcar is equivalent to the Ruby map call.

What does it generate then? Calling it with “pp generate_tree(:sentence)” I get something like this:

 [:noun_phrase, [:Name, "Lee"]],
  [:Verb, "saw"],
   [:Article, "the"],
    [:Adj, "green"],
   [:Noun, "table"],

which maps neatly back to our grammar. We can also generate all possible sentences for a grammar without recursion, using the same data driven approach.

The code for that can be found in generate_all.rb:

require 'rule_based'

def generate_all(phrase)
  case phrase
  when []
  when Array
  when Symbol
    rewrites(phrase).inject([]) { |sum, elt|  sum + generate_all(elt) }

def combine_all(xlist, ylist)
  ylist.inject([]) do |sum, y|
    sum + { |x| x+y }

If you run generate(:sentence) you will get back a list of all 256 possible sentences from this simple grammar. In this case the algorithm is a bit more complicated. It’s also using the common Lisp idiom of working on the first element of a list and then recur on the rest of it. This makes it possible to combine everything together. I assume that it should be possible to devise something suitably clever based on the new Array#permutations or possible Enumerable#group_by or zip.

It’s interesting how well the usage of mappend and mapcar maps to uses of inject and map in this code.

Note that I’ve been using globals for the grammars in this implementation. An alternative that is probably better is to pass along an optional parameter to the methods. If no grammar is supplied, just use the default constant instead.

Anyway, the code for this chapter is in the repository. Play around with it and see if you can find anything interesting. This code is definitely an introduction to Lisp, more than a serious AI program – although it does show the kind of approaches that have been used for primitive code generation.

The next chapter will talk about the General Problem Solver. Until then.

7 Comments, Comment or Ping

  1. James

    Hey Ola,

    Just wanted to let you know that the code snippet for $bigger_grammar appears to chop off the :noun_phrase line. I’m using FF3 on windows if that’s of any consequence.

    Really like the concept behind these posts and will be keeping an eye out for the next one.


    September 9th, 2008

  2. James,

    Thanks for that. I should have seen it when proofing it. It’s fixed now.

    September 9th, 2008

  3. Michel Demazure



    Typo : “then recur”, instead of “the recur”.


    September 10th, 2008

  4. Interesting post, thanks. I’m still learning Ruby so it’s good to have some practical examples!

    September 12th, 2008

Reply to “Language generation”