Wednesday, June 29, 2011

TF-IDF

A few days ago, someone implemented a consice TF-IDF ("term frequency - inverse document frequency") search engine in Scala. That was followed by a similarly concise version in Clojure. I thought to contribute a simple (hopefully also concise) implementation using Factor.

USING: accessors arrays assocs combinators.short-circuit fry
io.encodings.utf8 io.files kernel math math.functions
math.statistics memoize sequences sets sorting splitting
unicode.case unicode.categories ;

IN: tf-idf

Tokenize

Since our inputs are going to be text files, our program will need to:

  1. Read the file into a string.
  2. Split the string into words.
  3. Eliminate common (or "stop") words.

Fist, let's build a word to split our text on any non-letter or non-number characters.

: split-words ( string -- words )
    [ { [ Letter? ] [ digit? ] } 1|| not ] split-when harvest ;

We load a list of "stop words" that are to be ignored. These are typically common occurring words such as "and, in, or, of, the, is".

MEMO: stopwords ( -- words )
    "vocab:tf-idf/stopwords.txt" utf8 file-lines fast-set ;

We can then tokenize a piece of text, removing all of the "stop words".

: tokenize ( string -- words )
    >lower split-words [ stopwords in? not ] filter ;

And, finally, we can create a word to tokenize a series of files into an assoc mapping path to words.

: tokenize-files ( paths -- assoc )
    [ dup utf8 file-contents tokenize ] H{ } map>assoc ;

Index

To implement our search engine, we need to build an index that maps each word to list of (path, count) pairs.

: index1 ( path words -- path index )
    histogram [ pick swap 2array ] assoc-map ;

: index-all ( assoc -- index )
    [ index1 ] assoc-map values assoc-merge ;

The tokenized files and our index will form the basis for a "database":

TUPLE: db docs index ;

: <db> ( paths -- db )
    tokenize-files dup index-all db boa ;

TF-IDF

The "inverse document frequency" is calculated by the log of the total number of documents divided by number of documents where a particular word appears.

: idf ( term db -- idf )
    [ nip docs>> ] [ index>> at ] 2bi [ assoc-size ] bi@ / log ;

Using this, we can create a "TF-IDF" scores by multiplying the number of times a term appears in each document by the previously calculated "inverse document frequency":

: tf-idf ( term db -- scores )
    [ index>> at ] [ idf ] 2bi '[ _ * ] assoc-map ;

Search

Our search engine is now just a matter of:

  • Splitting an input query into terms.
  • Calculating the "TF-IDF" score for each term.
  • Normalizing the scores across documents.
  • Sorting the scores from highest to lowest.
: scores ( query db -- scores )
    [ split-words ] dip '[ _ tf-idf ] map assoc-merge ;

: (normalize) ( path db -- value )
    [ docs>> at ] keep '[ _ idf 2 ^ ] map-sum sqrt ;

: normalize ( scores db -- scores' )
    '[ sum over _ (normalize) / ] assoc-map ;

: search ( query db -- scores )
    [ scores ] keep normalize sort-values reverse ;

Notes

The implementation above uses an assoc-merge utility word that performs a "union", for each key collecting a list of all values.
: (assoc-merge) ( assoc1 assoc2 -- assoc1 )
    over [ push-at ] with-assoc assoc-each ;

: assoc-merge ( seq -- merge )
    H{ } clone [ (assoc-merge) ] reduce ;
Also, searching for words that don't exist produces a "divide by zero" error. Making it more robust requires some changes to either catch and ignore that error or to "add 1" to the denominator of the "IDF" formula.

The code for this is on my Github.

1 comment:

  1. Nice! Surprisingly, I find your Factor solution more concise than the Scala/Clojure ones :)

    ReplyDelete