An Ioke spelling corrector

January 7th, 2009

A while back Peter Norvig (of AI, Lisp and Google fame) published a small entry on how spelling correctors work. He included some code in Python to illustrate the concept, and this code have ended up being a very often used example of a programming language.

It would be ridiculous to suggest that I generally like to follow tradition, but in this case I think I will. This post will take a look at the version implemented in Ioke, and use that to highlight some of the interesting aspects of Ioke. The code itself is quite simple, and doesn’t use any of the object oriented features of Ioke. Neither does it use macros, although some of the features used is based on macros, so I will get to explain a bit of that.

For those that haven’t seen my Ioke posts before, you can find out more at http://ioke.org. The features used in this example is not yet released, so to follow along you’ll have to download and build yourself. Ioke S should be out in about 1-2 weeks though, and at that point this blog post will describe released software.

First, for reference, here is Norvig’s original corrector. You can also find his corpora there: http://norvig.com/spell-correct.html. Go read it! It’s a very good article.

This code is available in the Ioke repository in examples/spelling/spelling.ik. I’m adding a few line breaks here to make it fit the blog width – expect for that everything is the same.

Lets begin with the method “words”:

words = method(text,
  #/[a-z]+/ allMatches(text lower))

This method takes a text argument, then calls the method “lower” on it. This method will return a new text that is the original text converted to lower case. A regular expression that matches one or more alphabetical characters are used, and the method allMatches called on it. This method will return a list of texts of all the places matches in the text.

The next method is called “train”:

train = method(features,
  features fold({} withDefault(1), model, f,
    model[f] ++
    model))

The argument “features” should be a list of texts. It then calls fold on this list (you might know fold as reduce or inject. those would have been fine too.) The first argument to fold is the start value. This should be a dict, with a default value of 1. The second argument is the name that will be used to refer to the sum, and the third argument is the name to use for the feature. Finally, the last argument is the actual code to execute. This code just uses the feature (which is a text), indexes into the dict and increments the number there. It then returns the dict, since that will be the model argument the next iteration.

The next piece of code uses the former methods:

NWORDS = train(words(FileSystem readFully("small.txt")))

alphabet = "abcdefghijklmnopqrstuvwxyz" chars

As you can see we define a variable called NWORDS, that contains the result of first reading a text file, then extracting all the words from that text, and finally using that to train on. The next assignment gets a list of all the characters (as text) by calling “chars” on a text. I could have just written [“a”, “b”, “c”, …] etc, but I’m a bit lazy.

OK, now we come to the meat of the code. For an explanation of why this code does what it does, refer to Norvig’s article:

edits1 = method(word,
  s = for(i <- 0..(word length + 1),
    [word[0...i], word[i..-1]])

  set(
    *for(ab <- s,
      ab[0] + ab[1][1..-1]), ;deletes
    *for(ab <- s[0..-2],
      ab[0] + ab[1][1..1] + ab[1][0..0] + ab[1][2..-1]), ;transposes
    *for(ab <- s, c <- alphabet,
      ab[0] + c + ab[1][1..-1]), ;replaces
    *for(ab <- s, c <- alphabet,
      ab[0] + c + ab[1]))) ;inserts

The mechanics of it is this. We create a method assigned to the name edits1. This method takes one argument called “word”. We then create a local variable called “s”. This contains the result of executing a for comprehension. There are several things going on here. The first part of the comprehension gives a generator (that’s the part with the <-). The thing on the right is what to iterate over, and the thing on the left is the name to give it on each iteration. Basically, this comprehensions goes from 0 to the length of the word plus 1. (The dots denote an inclusive Range). The second argument to “for” is what to actually return. In this case we create a new array with two elements. The three dots creates an exclusive Range. Ending a Range in -1 means that it will extract the text to the end of it.

The rest of the code in this method is four different comprehensions. The result of these comprehensions are splatted, or spread out as arguments to the “set” method. The * is symbol to splat things. Basically, it means that instead of four lists, set will get all the elements of all the lists as separate arguments. Finally, set will create a set from these arguments and return that.

Whew. That was a mouthful. The next method is easy in comparison. More of the same, really:

knownEdits2 = method(word,
  for:set(e1 <- edits1(word),
    e2 <- edits1(e1),
    NWORDS key?(e2),
    e2))

Here we use another variation of a comprehension, namely a set comprehension. A regular comprehension returns a list. A set comprehension returns a set instead. This comprehension will only return words that are available as keys in NWORDS.

known = method(words,
  for:set(w <- words,
    NWORDS key?(w), w))

This method uses a set comprehension to find all words in “words” that are keys in NWORDS. As this point you might wonder what a comprehension actually is. And it’s quite easy. Basically, a comprehension is a piece of nice syntax around a combination of calls to “filter”, “map” and “flatMap”. In the case of a set comprehension, the calls go to “filter”, “map:set” and “flatMap:set” instead. The whole implementation of comprehensions are available in Ioke, in the file called src/builtin/F10_comprehensions.ik. Beware though, it uses some fairly advanced macro magic.

OK, back to spelling. Let’s look at the last method:

correct = method(word,
  candidates = known([word]) ifEmpty(
    known(edits1(word)) ifEmpty(
      knownEdits2(word) ifEmpty(
        [word])))
  candidates max(x, NWORDS[x]))

The correct method takes a word to correct and returns the best possible match. It first tries to see if the word is already known. If the result of that is empty, it tries to see if any edits of the word is known, and if that doesn’t work, if any edits of edits are known. Finally it just returns the original word. If more than one candidate spelling is known, the max method is used to determine which one was featured the most on the corpus.

The ifEmpty macro is a bit interesting. What it does is pretty simple, and you could have written it yourself. But it just happens to be part of the core library. The implementation looks like this:

List ifEmpty = dmacro(
  "if this list is empty, returns the result of evaluating the argument, otherwise returns the list",

  [then]
  if(empty?,
    call argAt(0),
    self))

This is a dmacro, which basically just means that the handling of arguments are taken care of. The argument for a destructured macro can be seen in the square brackets. Using a dmacro instead of just a raw macro means that we will get a good error message if the wrong number of arguments are provided. The implementation checks if its list is empty. If it is it returns the value of the first argument, otherwise it doesn’t evaluate anything and returns itself.

So, you have now seen a 16 line spelling corrector in Ioke (it’s 16 without the extra line breaks I added for the blog).

By Ola Bini | In: ioke | tags: ioke, norvig, programming languages, spelling corrector. | #

9 Comments, Comment or Ping

Sam Aaron

Hey Ola,

thanks for the interesting article. It’s certainly clarified a few more things for me, thanks.

If you have time, I do have a couple of questions though. Firstly, it seems that for comprehensions are pretty similar to calling each on a list. For example, I imagine the following lines are similar:

for(i <- 0..10+1, i println) ; for comprension

(0..10+1) each(println) ; each on a range

(0..10+1) each(i, i println) ; each with explicit variable name

If this is true (that they are similar) do you have any guidlines on when to use each form and if they’re not similar, I’d love to know what the differences are.

Just fumbling around, it seems that the differences arise when you apply multiple comprehensions in the same for clause:

for(i <- [1,2,3], j <- [-1, -2, -3, -4, -5], “i:#{i}, j:#{j}” println)

January 7th, 2009
Sam Aaron

Also, although I do believe in the potential expressive power of Ioke, it seems that in this case the Python version is a little more readable. (Perhaps this is due to a remarkably direct translation).

As an example, consider the following two statements:

candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word] ## Python

candidates = known([word]) ifEmpty(known(edits1(word)) ifEmpty(knownEdits2(word) ifEmpty([word]))) ;; Ioke

January 7th, 2009
vsevolod

Hi Ola!

I’ve got a few questions:
1. From what I remember about writing the corrector in Common Lisp, the hardest problem was, surprisingly, properly processing *big.txt*. Did you try to do it with this example. Or, maybe, you would rely completely on Java for this task?
2. Totally couldn’t understand how dmacro works (maybe, there’s a cropping of some part of the code?)
3. (Elaboration on the previous comment) In which aspects would you call an Ioke program better, than the Python equivalent?

January 8th, 2009
Ola Bini

Sam:

Yes, you’re totally right, using for comprehensions in this way is pretty similar to using the enumerable things. As I wrote in the article, for comprehensions are actually translated directly into a combination of map, filter and flatMap. For the simple case of just doing for(i <- 1..10, i println) there is really no point in using a comprehension. Specifically, a comprehension should be used either when you care about the result or when the explicit looping would be hard to read. This goes for nesting: 1..10 flatMap(x, 1..5 map(y, [x,y])) is probably not as readable as for(x <- 1..10, y <- 1..5, [x,y]) comprehensions serve as a higher abstraction, basically, that are sometimes more readable. It's also a style thing. Also, filtering is built in. But, you can do it all with the primitives, it might not be as readable though.

January 8th, 2009
Ola Bini

Sam:

Yes, you’re right, in this case the Python code is very clean. One reason for that is that it relies on empty things being false. That is, an empty string is false, an empty list is false, etc. I generally don’t like that at all in a programming language, but in this specific instance the or-operator makes it cleaner to read.

… Of course, that’s not something that couldn’t be changed. And in fact, in the latest version of the repository I’ve added relaxed or and relaxed and operators, that behave the same as their Python counterparts with regard to empty things. So the new spelling corrector has this line instead:

candidates = known([word]) ?| known(edits1(word)) ?| knownEdits2(word) ?| [word]

where ?| is the relaxed or.

January 8th, 2009
Ola Bini

vsevolod:

well, I’ve tried big.txt and it works, although it’s a bit slow. the code is the same, except for the filename.

dmacro’s are a bit hard to understand. I’m planning a new blog post that describes the reasoning behind them and how they are used. basically, dmacro is a macro that generates another macro. the generated macro includes pattern matching and verification that the macro got the correct amount of arguments, and then assigns these to the names inside square brackets.

oh, better is a VERY loaded word. I’m not sure I would ever use it. I do know that Ioke allow me to condense and DRY up functionality in a way you cannot do in Python. Take for example the for comprehensions. These are (including the syntax) totally implemented in Ioke. nothing extra needed. Of course, that is not impressive to a Lisp guy… =)

January 8th, 2009
Jurgen

“The features used in this example is not yet released…”, “A regular expression that matches one or more alphabetical characters are used…”

I have a Swedish collegue who makes exactly the same kind of errors since 15 years. So I assume it’s something in your language which puts you on the wrong foot.
This is in no way a criticism (the errors are totally irelevant), but more of an observation: it surprises me that somebody so gifted and experienced in crafting “computer languages” writes relatively trivial english grammar buglets. I guess it means that we probably over-estimate the language aspect of programming languages… our brain doesn’t spend a lot of time on the language aspect of programming languages and sees through it to tackle the logical puzzles we’re trying to solve… which is activating other areas of the brain… english grammar rules are obviously not stored/activated in the same way as the rules you have stored for Ruby, Lisp, Java and … Ioke.

Again, this is not a criticism (I’m not a native speaker myself and make my share of mistakes) and I take my hat off for the many miles you’re ahead of me in your craft. And I do hope that you’ll keep on blogging in english!

January 8th, 2009
Ola Bini

Jurgen:

You are totally right, I’ve always had a problem with that specific piece of English. I assume you’re referring to is vs are? For some reason I never get those right. Not sure if it is a Swedish thing or not, but I keep on doing it… =)

January 8th, 2009
Tomas

Ola, Jurgen:

I’ve mentioned this before. I find that often when there is a grammatical error it disrupts the flow of my reading such that I have to go back and read the whole sentence word-for-word. So it’s a little distracting, but no big deal really.

However, would it be possible to hook up a grammar checker to your blog editor?

I don’t know if there would be anything particular to Swedish that would make us prone to making these kinds of mistakes. We don’t conjugate verbs to agree with nouns or pronouns, so perhaps we lack the innate language circuitry to get it right.

As consolation, I’ve noted that native English speakers often get similar things wrong. For instance, “There exist many ways to proceed from here” is commonly mis-conjugated.

January 9th, 2009

Reply to “An Ioke spelling corrector”

« An Ioke update

Operators in Ioke »

January 7th, 2009