Friday, March 26, 2010

How to Write a Spelling Corrector

Several years ago, I came across an article about how to write a simple spelling corrector by Peter Norvig. It occurred to me recently, that he was missing an implementation in Factor, so I thought to contribute one.

Note: I try to keep the structure of this implementation close to his original Python, so that it can be more easily understood, and more easily tested in parts.

The premise behind his spelling correction algorithm is that when a word is mistyped, it is "close" to the actual word intended. For each incorrect word, the algorithm looks for all the correctly spelled words that can be made by one or two "fixes" (e.g., deleting an extra letter, swapping two letters, inserting a missing letter, etc.). This gives you the list of possible corrections.

Now, it wouldn't be Peter Norvig if it didn't have probabilities in it. And sure enough, he introduces a very simple solution for the "most likely correct word", which is to pick from the list of possible corrections, the word that is most commonly used in the given language.

We will first define a vocabulary to place our words, and list the other vocabularies that ours will depend upon:

USING: ascii arrays assocs combinators fry kernel
memoize make math math.order math.ranges io.encodings.ascii
io.files sequences sorting splitting strings ;

IN: spelling

To begin our implementation, we need to generate a list of words and frequencies of occurrence. He compiled a file called big.txt that contains about a million words from various public domain sources, which we will use also.

  1. Read the contents of the big.txt file.
  2. Split the text into words, using lowercase for case-insensitivity.
  3. Count the occurrences of words in the text.

This is accomplished through the following words:

: words ( string -- words )
    >lower [ letter? not ] split-when [ empty? not ] filter ;

: train ( words -- counts )
    H{ } clone [ '[ _ inc-at ] each ] keep ;

MEMO: NWORDS ( -- counts )
    "vocab:spelling/big.txt" ascii file-contents words train ;

Next, we need a word to calculate the possible edits that are one step removed by either:

  • Deleting a necessary character.
  • Transposing two characters.
  • Replacing an incorrect character with a correct one.
  • Insering a missing character.

Note: this is one place where Python's use of list comprehensions seem to create a much more concise and readable implementation.

alphabet = 'abcdefghijklmnopqrstuvwxyz'

def edits1(word):
   splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
   deletes = [a + b[1:] for a, b in splits if b]
   transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
   replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b]
   inserts = [a + c + b for a, b in splits for c in alphabet]
   return set(deletes + transposes + replaces + inserts)

One equivalent implementation in Factor works like this:

CONSTANT: ALPHABET "abcdefghijklmnopqrstuvwxyz"

: splits ( word -- sequence )
    [
        dup length [0,b] [ dupd cut 2array , ] each
    ] { } make nip ;

: deletes ( sequence -- sequence' )
    [
        [ first ] [ second 1 short tail ] bi append
    ] map ;

: transposes ( sequence -- sequence' )
    [ second length 1 > ] filter [
        {
            [ first ]
            [ second 1 swap ?nth [ 1string ] [ "" ] if* ]
            [ second 0 swap ?nth [ 1string ] [ "" ] if* ]
            [ second 2 short tail ]
        } cleave append append append
    ] map ;

: replaces ( sequence -- sequence' )
    [ second length 0 > ] filter [
        first2 '[ 1string _ swap _ 1 short tail 3append ]
        ALPHABET swap { } map-as
    ] map concat ;

: inserts ( sequence -- sequence' )
    [
        '[ 1string _ swap join ] ALPHABET swap { } map-as
    ] map concat ;

: edits1 ( word -- edits )
    splits {
        [ deletes ]
        [ transposes ]
        [ replaces ]
        [ inserts ]
    } cleave append append append ;

We can then make a word to calculate strings that are of two steps removed:

: edits2 ( word -- edits )
    edits1 [ edits1 ] map concat ;

And another word for filtering a list of words to only those that are known:

: known-words ( words -- words' )
    NWORDS '[ _ at ] filter ; inline

Putting all of this together, we can calculate a list of possible word corrections of the fewest steps away, sorted by the frequency of occurrence:

: corrections ( word -- words )
    dup 1array known-words
    [ dup edits1 known-words ] when-empty
    [ dup edits2 known-words ] when-empty
    [ [ NWORDS at 1 or ] bi@ >=< ] sort nip ;

And then a word to choose the most likely correction, or f if no corrections are available:

: correct ( word -- word'/f )
    corrections dup length 0 > [ first ] [ drop f ] if ;

Using it is pretty straightforward:

( scratchpad ) "speling" correct .
"spelling"

The whole solution comes in at 58 lines of code, although could probably be reduced in various ways.

There are always ways to improve algorithms, and this is no exception. Peter Norvig talks about some of those including a larger set of training text, using sentence context to choose from the possible list of word corrections, and focusing on performance.

4 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. I cleaned up your code and added it to extra. Check it out:

    extra/spelling/spelling.factor

    ReplyDelete
  3. I'm getting tempted to start learning Factor... But I already have a running project for learning FORTH... Dilemma ahead ;)

    ReplyDelete