A few days ago, someone implemented a consice TF-IDF ("term frequency - inverse document frequency") search engine in Scala. That was followed by a similarly concise version in Clojure. I thought to contribute a simple (hopefully also concise) implementation using Factor.
USING: accessors arrays assocs combinators.short-circuit fry io.encodings.utf8 io.files kernel math math.functions math.statistics memoize sequences sets sorting splitting unicode.case unicode.categories ; IN: tf-idf
Tokenize
Since our inputs are going to be text files, our program will need to:
- Read the file into a string.
- Split the string into words.
- Eliminate common (or "stop") words.
Fist, let's build a word to split our text on any non-letter or non-number characters.
: split-words ( string -- words ) [ { [ Letter? ] [ digit? ] } 1|| not ] split-when harvest ;
We load a list of "stop words" that are to be ignored. These are typically common occurring words such as "and, in, or, of, the, is".
MEMO: stopwords ( -- words ) "vocab:tf-idf/stopwords.txt" utf8 file-lines fast-set ;
We can then tokenize a piece of text, removing all of the "stop words".
: tokenize ( string -- words ) >lower split-words [ stopwords in? not ] filter ;
And, finally, we can create a word to tokenize a series of files into an assoc
mapping path
to words
.
: tokenize-files ( paths -- assoc ) [ dup utf8 file-contents tokenize ] H{ } map>assoc ;
Index
To implement our search engine, we need to build an index that maps each word
to list of (path, count)
pairs.
: index1 ( path words -- path index ) histogram [ pick swap 2array ] assoc-map ; : index-all ( assoc -- index ) [ index1 ] assoc-map values assoc-merge ;
The tokenized files and our index will form the basis for a "database":
TUPLE: db docs index ; : <db> ( paths -- db ) tokenize-files dup index-all db boa ;
TF-IDF
The "inverse document frequency" is calculated by the log of the total number of documents divided by number of documents where a particular word appears.
: idf ( term db -- idf ) [ nip docs>> ] [ index>> at ] 2bi [ assoc-size ] bi@ / log ;
Using this, we can create a "TF-IDF" scores by multiplying the number of times a term appears in each document by the previously calculated "inverse document frequency":
: tf-idf ( term db -- scores ) [ index>> at ] [ idf ] 2bi '[ _ * ] assoc-map ;
Search
Our search engine is now just a matter of:
- Splitting an input query into terms.
- Calculating the "TF-IDF" score for each term.
- Normalizing the scores across documents.
- Sorting the scores from highest to lowest.
: scores ( query db -- scores ) [ split-words ] dip '[ _ tf-idf ] map assoc-merge ; : (normalize) ( path db -- value ) [ docs>> at ] keep '[ _ idf 2 ^ ] map-sum sqrt ; : normalize ( scores db -- scores' ) '[ sum over _ (normalize) / ] assoc-map ; : search ( query db -- scores ) [ scores ] keep normalize sort-values reverse ;
Notes
The implementation above uses anassoc-merge
utility word that performs a "union", for each key collecting a list of all values.Also, searching for words that don't exist produces a "divide by zero" error. Making it more robust requires some changes to either catch and ignore that error or to "add 1" to the denominator of the "IDF" formula.: (assoc-merge) ( assoc1 assoc2 -- assoc1 ) over [ push-at ] with-assoc assoc-each ; : assoc-merge ( seq -- merge ) H{ } clone [ (assoc-merge) ] reduce ;
The code for this is on my Github.