Wednesday, December 30, 2009

Counting Word Frequencies

One of my favorite memes right now is for people to write small programs in various languages to compare and contrast and learn from one another.

Recently, someone blogged an example of counting word frequencies in a collection of newsgroup articles. The initial implementation was written in Ruby and Scala, with someone else implementing a Clojure solution. These solutions are compared for lines of code as well as time to run.

The basic idea is to iterate over a bunch of files, where each represents a newsgroup posting and is organized by newsgroup into directories. Split each file into words, and then increment a count per word found (comparing case-insensitively).

First, we need a test that is equivalent to the "\w" regular expression. In ASCII, this is essentially a-zA-Z0-9_. We use a short-circuit combinator to break early if one of the tests succeeds.

: \w? ( ch -- ? )
    { [ Letter? ] [ digit? ] [ CHAR: _ = ] } 1|| ; inline

We can then build a word to split a sequence of characters.

: split-words ( seq -- seq' )
    [ \w? not ] split-when harvest ;

This now leaves the main task of counting and aggregating the word counts. The Ruby and Scala examples do this sequentially, but the Clojure example tries to do things in parallel. We are going to keep it simple, and do things sequentially.

: count-words ( path -- assoc )
    f recursive-directory-files H{ } clone [
            ascii file-contents >lower 
            split-words [ _ inc-at ] each
        ] each
    ] keep ;

We need a word to generate the desired output (tab-delimited words and counts).

: print-words ( seq -- )
    [ first2 "%s\t%d\n" printf ] each ;

And another word to do the actual writing of output to files.

: write-count ( assoc -- )
    >alist [
        [ "/tmp/counts-decreasing-factor" ascii ] dip
        '[ _ sort-values reverse print-words ] with-file-writer
    ] [
        [ "/tmp/counts-alphabetical-factor" ascii ] dip
        '[ _ sort-keys print-words ] with-file-writer
    ] bi ;

Unfortunately, performance isn't quite what I was hoping. I tested this on a 2.8 GHz MacBook Pro. Ruby (using 1.8.7) runs in roughly 41 seconds, Factor runs in 20 seconds, and Python (using 2.6.1) runs in 13 seconds.

I was sort of hoping Factor would come in under the typical scripting languages, and I'd love to get feedback on how to improve it.

For reference, the Python version that I wrote is:

import os
import re
import time
from collections import defaultdict
from operator import itemgetter

root = '/tmp/20_newsgroups'
#root = '/tmp/mini_newsgroups'

t0 = time.time()

counts = defaultdict(int)

for dirpath, dirname, filenames in os.walk(root):
    for filename in filenames:
        f = open(os.path.join(dirpath, filename))
        for word in re.findall('\w+',
            counts[word.lower()] += 1

print "Writing counts in decreasing order"
f = open('counts-decreasing-python', 'w')
for k, v in sorted(counts.items(), key=itemgetter(1), reverse=True):
    print >> f, '%s\t%d' % (k, v)

print "Writing counts in decreasing order"
f = open('counts-alphabetical-python', 'w')
for k, v in sorted(counts.items(), key=itemgetter(0)):
    print >> f, '%s\t%d' % (k, v)

print 'Finished in %s seconds' % (time.time() - t0)


middayc said...

Interesting. Did you maybe figure out what's the main perf. bottleneck. I saw the page where solutions in 7 languages are compared and python is the fastest ( also compared to java/scala) which seems odd. I hope Slava will tell his thoughts on this :)

mrjbq7 said...

I suspect the poor performance is largely related to copying memory too frequently, particularly for operations that could act in-place, but instead act on a copy of the data.

I'm not quite sure what profiling tools exist in Factor besides inspecting the code that is generated, and timing it.