Steve Hanov blogged about building a thesaurus using a "zero load time" file formats. Below, we translate his implementation into Factor.
You can download the 11 MB thesaurus data file we will be using (containing over 100,000 words and their lists of related words). It is implemented as a single file with a custom binary file format that looks like this:
[ header ] 4 bytes: number of words [ index section ] # The words are listed in alphabetical order, so you # can look one up using binary search. for each word: 4 byte pointer to word record [ word section ] for each word: null terminated text 4 bytes: number of related words for each link: pointer to linked word record
Build It
The data file consists of 4 byte "pointers" and null-terminated strings. We can build words to read an integer or a string from a particular location in the file:The number of words in the thesaurus is at the beginning of the file:: read-int ( ptr -- n ) seek-absolute seek-input 4 read le> ; : read-string ( ptr -- string ) seek-absolute seek-input "\0" read-until drop >string ;
The position of each word is found by reading the "nth" index:: #words ( -- n ) 0 read-int ;
The "nth" word is the string found at the specified word position:: word-position ( n -- ptr ) 4 * 4 + read-int ;
Now for the fun part. Knowing that the index is sorted, we can build a word that performs a binary search for a particular word using the index.: nth-word ( n -- word ) word-position read-string ;
Once we found the word that we are looking for, we can read its related words.:: find-word ( word -- n ) #words :> high! -1 :> low! f :> candidate! [ high low - 1 > ] [ high low + 2 /i :> probe probe nth-word candidate! candidate word <=> { { +eq+ [ probe high! probe low! ] } { +lt+ [ probe low! ] } [ drop probe high! ] } case ] while candidate word = [ high ] [ f ] if ;
Putting this all together, we can construct a file reader from the thesaurus file, a convenience word to run a quotation with the thesaurus as its input stream, and our "related words" function.:: find-related ( word -- words ) word find-word [ word-position word length + 1 + :> ptr ptr read-int :> #related ptr #related [1,b] 4 v*n n+v [ read-int read-string ] map ] [ { } ] if* ;
: <thesaurus-reader> ( -- reader ) "vocab:thesaurus/thesaurus.dat" binary <file-reader> ; : with-thesaurus ( quot -- ) [ <thesaurus-reader> ] dip with-input-stream ; inline : related-words ( word -- words ) [ find-related ] with-thesaurus ;
Try It
If it is all working properly, you should be able to lookup the words that are related to any word that is in our thesaurus file.As for performance, it takes just over one millisecond on my laptop to lookup a single word. Not too shabby! The code for this is on my Github.( scratchpad ) "food" related-words . { "aliment" "bread" "chow" "comestibles" "commons" "eatables" "eats" "edibles" "feed" "foodstuff" "foodstuffs" "grub" "meat" "nourishment" "nurture" "nutriment" "pabulum" "pap" "provender" "provisions" "rations" "scoff" "subsistence" "sustenance" "tuck" "viands" "victuals" }
Seems like a nice solution. FWIW, my old Aiksaurus program had a somewhat similar approach, navigating its two data files using fseek to avoid loading everything into memory. Of course, its data files were smaller than what you're dealing with here (about 500 KB together), but at the time this seemed like too much overhead. Ah, how things have come along. :)
ReplyDelete