Sunday, August 28, 2011

Thesaurus

Steve Hanov blogged about building a thesaurus using a "zero load time" file formats. Below, we translate his implementation into Factor.

You can download the 11 MB thesaurus data file we will be using (containing over 100,000 words and their lists of related words). It is implemented as a single file with a custom binary file format that looks like this:

[ header ]
4 bytes: number of words

[ index section ]
# The words are listed in alphabetical order, so you
# can look one up using binary search.
for each word:
    4 byte pointer to word record

[ word section ]
for each word:
   null terminated text
   4 bytes: number of related words
   for each link:
       pointer to linked word record

Build It

The data file consists of 4 byte "pointers" and null-terminated strings. We can build words to read an integer or a string from a particular location in the file:
: read-int ( ptr -- n )
    seek-absolute seek-input 4 read le> ;

: read-string ( ptr -- string )
    seek-absolute seek-input "\0" read-until drop >string ;
The number of words in the thesaurus is at the beginning of the file:
: #words ( -- n ) 0 read-int ;
The position of each word is found by reading the "nth" index:
: word-position ( n -- ptr ) 4 * 4 + read-int ;
The "nth" word is the string found at the specified word position:
: nth-word ( n -- word ) word-position read-string ;
Now for the fun part. Knowing that the index is sorted, we can build a word that performs a binary search for a particular word using the index.
:: find-word ( word -- n )
    #words :> high! -1 :> low! f :> candidate!
    [ high low - 1 > ] [
        high low + 2 /i :> probe
        probe nth-word candidate!
        candidate word <=> {
            { +eq+ [ probe high! probe low! ] }
            { +lt+ [ probe low! ] }
            [ drop probe high! ]
        } case
    ] while candidate word = [ high ] [ f ] if ;
Once we found the word that we are looking for, we can read its related words.
:: find-related ( word -- words )
    word find-word [
        word-position word length + 1 + :> ptr
        ptr read-int :> #related
        ptr #related [1,b] 4 v*n n+v
        [ read-int read-string ] map
    ] [ { } ] if* ;
Putting this all together, we can construct a file reader from the thesaurus file, a convenience word to run a quotation with the thesaurus as its input stream, and our "related words" function.
: <thesaurus-reader> ( -- reader )
    "vocab:thesaurus/thesaurus.dat" binary <file-reader> ;

: with-thesaurus ( quot -- )
    [ <thesaurus-reader> ] dip with-input-stream ; inline

: related-words ( word -- words )
    [ find-related ] with-thesaurus ;

Try It

If it is all working properly, you should be able to lookup the words that are related to any word that is in our thesaurus file.
( scratchpad ) "food" related-words .
{
    "aliment"
    "bread"
    "chow"
    "comestibles"
    "commons"
    "eatables"
    "eats"
    "edibles"
    "feed"
    "foodstuff"
    "foodstuffs"
    "grub"
    "meat"
    "nourishment"
    "nurture"
    "nutriment"
    "pabulum"
    "pap"
    "provender"
    "provisions"
    "rations"
    "scoff"
    "subsistence"
    "sustenance"
    "tuck"
    "viands"
    "victuals"
}
As for performance, it takes just over one millisecond on my laptop to lookup a single word. Not too shabby! The code for this is on my Github.

Wednesday, August 24, 2011

Successor

A few days ago, I wrote about translating the humanize function from the slang.js "string utility" library into Factor. While looking through the other functions defined in that library, I came across the successor function.

The "successor" of a string is defined to be a kind of "alphanum increment". It's easiest to show a few examples of how it works:

successor("a")         == "b"
successor("1")         == "2"
successor("abcd")      == "abce"
successor("THX1138")   == "THX1139"
successor("<<koala>>") == "<<koalb>>"
successor("1999zzz")   == "2000aaa"
successor("ZZZ9999")   == "AAAA0000"

We are going to implement this in Factor, using the slang.js documentation as a guide:

"Returns the successor to str. The successor is calculated by incrementing characters starting from the rightmost alphanumeric (or the rightmost character if there are no alphanumerics) in the string. Incrementing a digit always results in another digit, and incrementing a letter results in another letter of the same case.

If the increment generates a carry, the character to the left of it is incremented. This process repeats until there is no carry, adding an additional character if necessary."

To start, we should handle the "carry" logic. There are two kinds of carries: digits and letters. Both involve checking if a character has exceeded a range (resetting it to the beginning of the range if it has). We can build a word that does this, returning the new character value as well as a boolean flag indicating if a carry occurred:
: carry ( elt last first -- ? elt' )
    '[ _ > dup _ ] keep ? ;
Using this to carry digits is pretty easy (using the 0 to 9 range):
: next-digit ( ch -- ? ch' )
    1 + CHAR: 9 CHAR: 0 carry ;
To carry letters, we need to make sure that the carry preserves the original case (uppercase or lowercase) of the letter:
: next-letter ( ch -- ? ch' )
    [ ch>lower 1 + CHAR: z CHAR: a carry ] [ LETTER? ] bi
    [ ch>upper ] when ;
And, finally, to generalize this to all characters, we check if it is a digit or a letter and dispatch to the proper function, or pass the character through if it is neither:
: next-char ( ch -- ? ch' )
    {
        { [ dup digit?  ] [ next-digit  ] }
        { [ dup Letter? ] [ next-letter ] }
        [ t swap ]
    } cond ;
This leaves the core of the algorithm, which starts at the end of the string, incrementing each character (continuing if the carry flag is true), and then handling the case where we need to carry the first character:
: (successor) ( str -- str' )
    dup length t [ over 0 > dupd and ] [
        drop 1 - dup pick [ next-char ] change-nth
    ] while nip [ dup first prefix ] when ;

: successor ( str -- str' )
    dup empty? [ (successor) ] unless ;
The code for this is on my Github, and compares favorably at 30 lines of code versus the original 50. The (successor) function uses a fair amount of stack shuffling, can you improve it?

Saturday, August 20, 2011

Unique

A few days ago, I noticed this example from the Racket website, for reporting "each unique line from stdin":
;; Report each unique line from stdin
(let ([saw (make-hash)])
  (for ([line (in-lines)])
    (unless (hash-ref saw line #f)
      (displayln line))
    (hash-set! saw line #t)))
We can implement the same functionality in Factor, reading each unique line from an input stream:
: unique-lines ( -- )
    lines members [ print ] each ;
The lines word acts on the "current input stream", so we can use a file reader as an input stream to print out all unique lines in a file:
: unique-file ( path -- )
    utf8 [ unique-lines ] with-file-reader ;
If we wanted to make this print and flush each unique line of input as it is read, we could have used the each-line word to implement it in a line-by-line fashion:
: unique-lines ( -- )
    HS{ } clone [
        dup pick in? [ drop ] [
            [ over adjoin ]
            [ print flush ] bi
        ] if
    ] each-line drop ;

Sunday, August 14, 2011

Human Numbers

I noticed a project on Github called slang.js. It includes a number of "string utility" functions that might be useful to Javascript developers. One that struck me as interesting was the humanize function for turning numbers into "humanized" strings such as "1st, 2nd, 3rd or 4th".
Note: this function is sometimes called "ordinalize" - for example, in the django.contrib.humanize module or the inflector python project.
We are going to translate this original version into Factor:
function humanize(number) {
    if(number % 100 >= 11 && number % 100 <= 13)
        return number + "th";

    switch(number % 10) {
        case 1: return number + "st";
        case 2: return number + "nd";
        case 3: return number + "rd";
    }

    return number + "th";
}
If we keep the same structure (although, without the advantage that early returns can provide), it looks like this:
: humanize ( n -- str )
    dup 100 mod 11 13 between? [ "th" ] [
        dup 10 mod {
            { 1 [ "st" ] }
            { 2 [ "nd" ] }
            { 3 [ "rd" ] }
            [ drop "th" ]
        } case
    ] if [ number>string ] [ append ] bi* ;
And, then build some tests to make sure it works.
[ "1st" ] [ 1 humanize ] unit-test
[ "2nd" ] [ 2 humanize ] unit-test
[ "3rd" ] [ 3 humanize ] unit-test
[ "4th" ] [ 4 humanize ] unit-test
[ "11th" ] [ 11 humanize ] unit-test
[ "12th" ] [ 12 humanize ] unit-test
[ "13th" ] [ 13 humanize ] unit-test
[ "21st" ] [ 21 humanize ] unit-test

Wednesday, August 10, 2011

Printf

The venerable printf function is available in most languages. Used for "formatted printing", it allows you to convert most basic data types to a string. Several years ago, I contributed an implementation for Factor that currently lives in the formatting vocabulary. Using it looks a bit like this:
( scratchpad ) 12 "There are %d monkeys" printf
There are 12 monkeys

Implementation

One of the neat things about this version, is that the format string is parsed and code to format the arguments is generated at compile-time. Below, I've created a simplified version of printf to show how this works.
USING: io io.streams.string kernel macros make math math.parser
peg.ebnf present quotations sequences strings ;
We use the peg.ebnf vocabulary to parse the format string into a sequence of quotations (either strings or format instructions). Each quotation uses the make vocabulary to add these strings to a sequence (to be written out):
EBNF: parse-printf

fmt-%      = "%"   => [[ [ "%" ] ]]
fmt-c      = "c"   => [[ [ 1string ] ]]
fmt-s      = "s"   => [[ [ present ] ]]
fmt-d      = "d"   => [[ [ >integer number>string ] ]]
fmt-f      = "f"   => [[ [ >float number>string ] ]]
fmt-x      = "x"   => [[ [ >hex ] ]]
unknown    = (.)*  => [[ >string throw ]]

strings    = fmt-c|fmt-s
numbers    = fmt-d|fmt-f|fmt-x

formats    = "%"~ (strings|numbers|fmt-%|unknown)

plain-text = (!("%").)+
                   => [[ >string 1quotation ]]

text       = (formats|plain-text)*
                   => [[ [ \ , suffix ] map ]]

;EBNF
You can see the EBNF output by trying it in the listener:
( scratchpad ) "There are %d monkeys" parse-printf .
V{
    [ "There are " , ]
    [ >integer number>string , ]
    [ " monkeys" , ]
}
The "printf" macro takes the parsed output, reverses it (so the elements to be formatted can be passed on the stack in their natural order), applies each format quotation to the elements on the stack, and then writes them back in the original order.
MACRO: printf ( format-string -- )
    parse-printf reverse  [ ] concat-as [
        { } make reverse [ write ] each
    ] curry ;
You can use expand-macros to see the code the macro generates:
( scratchpad ) [ "There are %d monkeys" printf ] expand-macros .
[
    [ " monkeys" , >integer number>string , "There are " , ]
    { } make reverse [ write ] each
]
Implementing sprintf is easy using string streams to capture the output into a string object:
: sprintf ( format-string -- result )
    [ printf ] with-string-writer ; inline

Tests

We can write some unit tests to show that it works:
[ "" ] [ "" sprintf ] unit-test
[ "asdf" ] [ "asdf" sprintf ] unit-test
[ "10" ] [ 10 "%d" sprintf ] unit-test
[ "-10" ] [ -10 "%d" sprintf ] unit-test
[ "ff" ] [ HEX: ff "%x" sprintf ] unit-test
[ "Hello, World!" ] [ "Hello, World!" "%s" sprintf ] unit-test
[ "printf test" ] [ "printf test" sprintf ] unit-test
[ "char a = 'a'" ] [ CHAR: a "char %c = 'a'" sprintf ] unit-test
[ "0 message(s)" ] [ 0 "message" "%d %s(s)" sprintf ] unit-test
[ "10%" ] [ 10 "%d%%" sprintf ] unit-test
[ "[monkey]" ] [ "monkey" "[%s]" sprintf ] unit-test

This implementation doesn't support various format parameters such as width, alignment, padding characters, uppercase/lowercase, decimal digits, or scientific notation. Nor does it support formatting sequences and assocs, like the official version. However, adding those features is straightforward once you understand the basic mechanics.

The code for this is on my Github.

Saturday, August 6, 2011

FizzBuzz

The "new classic" programming test seems to be the FizzBuzz problem. I think it was first proposed in a blog post from 2007.

Now that it has garnered so much awareness (even videos on YouTube!), it's probably not a good interview question anymore. However, now that implementations have been written in many languages, it can be used to learn and compare the syntax of new languages.

Implementation

We can see how it works in Python:
>>> for i in range(1,101):
        if not i % 15:
            print "FizzBuzz"
        elif not i % 3:
            print "Fizz"
        elif not i % 5:
            print "Buzz"
        else:
            print i
A similar version in Clojure:
=>(defn multiple? [n div]
    (= 0 (mod n div)))

=>(doseq [i (range 1 101)]
     (cond (and (multiple? i 3)(multiple? i 5))
             (println "FizzBuzz")
           (multiple? i 3)
             (println "Fizz")
           (multiple? i 5)
             (println "Buzz")
       :else (println i)))
And, finally, a version in Factor:
( scratchpad ) 100 [1,b] [
                   {
                       { [ dup 15 divisor? ] [ drop "FizzBuzz" ] }
                       { [ dup 3  divisor? ] [ drop "Fizz"     ] }
                       { [ dup 5  divisor? ] [ drop "Buzz"     ] }
                       [ present ]
                   } cond print
               ] each

Improvements

Let's see if we can improve the Factor version a bit. First, we can "factor out" the FizzBuzz logic into its own function (showing how it would look if you were to code this directly into the listener):
( scratchpad ) : fizzbuzz ( n -- )
                   {
                       { [ dup 15 divisor? ] [ drop "FizzBuzz" ] }
                       { [ dup 3  divisor? ] [ drop "Fizz"     ] }
                       { [ dup 5  divisor? ] [ drop "Buzz"     ] }
                       [ present ]
                   } cond print ;

( scratchpad ) 100 [1,b] [ fizzbuzz ] each
To avoid all the dup and drop words, we could build a variation of cond that acts a bit like a case. The "cond-case" word was suggested on the Factor mailing list, and this is one variant of it:
MACRO: cond-case ( assoc -- )
    [
        dup callable? not [
            [ first [ dup ] prepose ]
            [ second [ drop ] prepose ] bi 2array
        ] when
    ] map [ cond ] curry ;
Using cond-case, we can improve the original version:
: fizzbuzz ( n -- )
    {
        { [ 15 divisor? ] [ "FizzBuzz" ] }
        { [ 3  divisor? ] [ "Fizz"     ] }
        { [ 5  divisor? ] [ "Buzz"     ] }
        [ present ]
    } cond-case print ;
If we realize that the check for divisible by 15 is the same as checking for divisible by 3 and divisible by 5, we can implement it slightly differently, without a cond or case:
: fizz ( n -- str/f )
    3 divisor? "Fizz" and ;

: buzz ( n -- str/f )
    5 divisor? "Buzz" and ;

: fizzbuzz ( n -- )
    dup [ fizz ] [ buzz ] bi "" append-as
    [ present ] [ nip ] if-empty print ;
Is it better? Can you think of any way to make it simpler? Perhaps by using or inventing some higher-level concepts like we did with cond-case?