Friday, March 26, 2010

How to Write a Spelling Corrector

Several years ago, I came across an article about how to write a simple spelling corrector by Peter Norvig. It occurred to me recently, that he was missing an implementation in Factor, so I thought to contribute one.

Note: I try to keep the structure of this implementation close to his original Python, so that it can be more easily understood, and more easily tested in parts.

The premise behind his spelling correction algorithm is that when a word is mistyped, it is "close" to the actual word intended. For each incorrect word, the algorithm looks for all the correctly spelled words that can be made by one or two "fixes" (e.g., deleting an extra letter, swapping two letters, inserting a missing letter, etc.). This gives you the list of possible corrections.

Now, it wouldn't be Peter Norvig if it didn't have probabilities in it. And sure enough, he introduces a very simple solution for the "most likely correct word", which is to pick from the list of possible corrections, the word that is most commonly used in the given language.

We will first define a vocabulary to place our words, and list the other vocabularies that ours will depend upon:

USING: ascii arrays assocs combinators fry kernel
memoize make math math.order math.ranges io.encodings.ascii
io.files sequences sorting splitting strings ;

IN: spelling

To begin our implementation, we need to generate a list of words and frequencies of occurrence. He compiled a file called big.txt that contains about a million words from various public domain sources, which we will use also.

  1. Read the contents of the big.txt file.
  2. Split the text into words, using lowercase for case-insensitivity.
  3. Count the occurrences of words in the text.

This is accomplished through the following words:

: words ( string -- words )
    >lower [ letter? not ] split-when [ empty? not ] filter ;

: train ( words -- counts )
    H{ } clone [ '[ _ inc-at ] each ] keep ;

MEMO: NWORDS ( -- counts )
    "vocab:spelling/big.txt" ascii file-contents words train ;

Next, we need a word to calculate the possible edits that are one step removed by either:

  • Deleting a necessary character.
  • Transposing two characters.
  • Replacing an incorrect character with a correct one.
  • Insering a missing character.

Note: this is one place where Python's use of list comprehensions seem to create a much more concise and readable implementation.

alphabet = 'abcdefghijklmnopqrstuvwxyz'

def edits1(word):
   splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
   deletes = [a + b[1:] for a, b in splits if b]
   transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
   replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b]
   inserts = [a + c + b for a, b in splits for c in alphabet]
   return set(deletes + transposes + replaces + inserts)

One equivalent implementation in Factor works like this:

CONSTANT: ALPHABET "abcdefghijklmnopqrstuvwxyz"

: splits ( word -- sequence )
    [
        dup length [0,b] [ dupd cut 2array , ] each
    ] { } make nip ;

: deletes ( sequence -- sequence' )
    [
        [ first ] [ second 1 short tail ] bi append
    ] map ;

: transposes ( sequence -- sequence' )
    [ second length 1 > ] filter [
        {
            [ first ]
            [ second 1 swap ?nth [ 1string ] [ "" ] if* ]
            [ second 0 swap ?nth [ 1string ] [ "" ] if* ]
            [ second 2 short tail ]
        } cleave append append append
    ] map ;

: replaces ( sequence -- sequence' )
    [ second length 0 > ] filter [
        first2 '[ 1string _ swap _ 1 short tail 3append ]
        ALPHABET swap { } map-as
    ] map concat ;

: inserts ( sequence -- sequence' )
    [
        '[ 1string _ swap join ] ALPHABET swap { } map-as
    ] map concat ;

: edits1 ( word -- edits )
    splits {
        [ deletes ]
        [ transposes ]
        [ replaces ]
        [ inserts ]
    } cleave append append append ;

We can then make a word to calculate strings that are of two steps removed:

: edits2 ( word -- edits )
    edits1 [ edits1 ] map concat ;

And another word for filtering a list of words to only those that are known:

: known-words ( words -- words' )
    NWORDS '[ _ at ] filter ; inline

Putting all of this together, we can calculate a list of possible word corrections of the fewest steps away, sorted by the frequency of occurrence:

: corrections ( word -- words )
    dup 1array known-words
    [ dup edits1 known-words ] when-empty
    [ dup edits2 known-words ] when-empty
    [ [ NWORDS at 1 or ] bi@ >=< ] sort nip ;

And then a word to choose the most likely correction, or f if no corrections are available:

: correct ( word -- word'/f )
    corrections dup length 0 > [ first ] [ drop f ] if ;

Using it is pretty straightforward:

( scratchpad ) "speling" correct .
"spelling"

The whole solution comes in at 58 lines of code, although could probably be reduced in various ways.

There are always ways to improve algorithms, and this is no exception. Peter Norvig talks about some of those including a larger set of training text, using sentence context to choose from the possible list of word corrections, and focusing on performance.

Monday, March 8, 2010

Working with CGI: Part 4

In parts 1, 2, and 3, we learned about parsing CGI requests using Factor.

I wanted to take a moment and talk about using templates to develop web sites. Many web developers will suggest that using templates is a good way to separate presentation and content, making it easier for teams to work together to build a website, as well as other benefits.

There is no magic to most template systems. Fundamentally, these systems take some text (usually in a hybrid "template language"), compile it into code, using the compiled template to output a dynamic page.

In Factor, the html.templates vocabulary allows access to several HTML template implementations including Chloe and FHTML. Some differences between the two include:

  • requiring XML (Chloe) or HTML (FHTML)
  • embedding data (Chloe) or Factor code (FHTML)

Depending on your use-case, you might have a preference for one versus the other. If neither suite your purpose, you could even implement your own with a modest amount of work.

Embedding a template into a Factor CGI script is pretty easy. Here is an example of using the FHTML vocabulary to parse a template and then call it to render a page.

#! /path/to/factor

USE: io

"Content-type: text/html\n\n" print

USE: html.templates.fhtml

"""
<% USING: calendar formatting math math.parser io ; %>

<html>
    <head><title>Simple Embedded Factor Example</title></head>
    <body>
        The time is <% now "%c" strftime write %>
        <br>
        <% 5 [ %><p>I like repetition</p><% ] times %>
    </body>
</html>

""" parse-template call( -- )

When run from Factor, or accessed as a CGI script, this will print a page something like the following:

The time is Mon Mar 08 23:12:47 2010.

I like repetition

I like repetition

I like repetition

I like repetition

I like repetition

Many large web applications are built with some kind of template engine, frequently using smart caching of compiled templates for performance, and usually storing the template files separately from the CGI script that loads and calls them.

But, it all basically starts with this.