Wednesday, December 30, 2009

Counting Word Frequencies

One of my favorite memes right now is for people to write small programs in various languages to compare and contrast and learn from one another.

Recently, someone blogged an example of counting word frequencies in a collection of newsgroup articles. The initial implementation was written in Ruby and Scala, with someone else implementing a Clojure solution. These solutions are compared for lines of code as well as time to run.

The basic idea is to iterate over a bunch of files, where each represents a newsgroup posting and is organized by newsgroup into directories. Split each file into words, and then increment a count per word found (comparing case-insensitively).

First, we need a test that is equivalent to the "\w" regular expression. In ASCII, this is essentially a-zA-Z0-9_. We use a short-circuit combinator to break early if one of the tests succeeds.

: \w? ( ch -- ? )
    { [ Letter? ] [ digit? ] [ CHAR: _ = ] } 1|| ; inline

We can then build a word to split a sequence of characters.

: split-words ( seq -- seq' )
    [ \w? not ] split-when harvest ;

This now leaves the main task of counting and aggregating the word counts. The Ruby and Scala examples do this sequentially, but the Clojure example tries to do things in parallel. We are going to keep it simple, and do things sequentially.

: count-words ( path -- assoc )
    f recursive-directory-files H{ } clone [
        '[
            ascii file-contents >lower 
            split-words [ _ inc-at ] each
        ] each
    ] keep ;

We need a word to generate the desired output (tab-delimited words and counts).

: print-words ( seq -- )
    [ first2 "%s\t%d\n" printf ] each ;

And another word to do the actual writing of output to files.

: write-count ( assoc -- )
    >alist [
        [ "/tmp/counts-decreasing-factor" ascii ] dip
        '[ _ sort-values reverse print-words ] with-file-writer
    ] [
        [ "/tmp/counts-alphabetical-factor" ascii ] dip
        '[ _ sort-keys print-words ] with-file-writer
    ] bi ;

Unfortunately, performance isn't quite what I was hoping. I tested this on a 2.8 GHz MacBook Pro. Ruby (using 1.8.7) runs in roughly 41 seconds, Factor runs in 20 seconds, and Python (using 2.6.1) runs in 13 seconds.

I was sort of hoping Factor would come in under the typical scripting languages, and I'd love to get feedback on how to improve it.

For reference, the Python version that I wrote is:

import os
import re
import time
from collections import defaultdict
from operator import itemgetter

root = '/tmp/20_newsgroups'
#root = '/tmp/mini_newsgroups'

t0 = time.time()

counts = defaultdict(int)

for dirpath, dirname, filenames in os.walk(root):
    for filename in filenames:
        f = open(os.path.join(dirpath, filename))
        for word in re.findall('\w+', f.read()):
            counts[word.lower()] += 1
        f.close()

print "Writing counts in decreasing order"
f = open('counts-decreasing-python', 'w')
for k, v in sorted(counts.items(), key=itemgetter(1), reverse=True):
    print >> f, '%s\t%d' % (k, v)
f.close()

print "Writing counts in decreasing order"
f = open('counts-alphabetical-python', 'w')
for k, v in sorted(counts.items(), key=itemgetter(0)):
    print >> f, '%s\t%d' % (k, v)
f.close()

print 'Finished in %s seconds' % (time.time() - t0)

Tuesday, December 29, 2009

Recursively Listing Files

Update: It was pointed out to me that the recursive-directory-files word in io.directories.search solves this problem. Good to know, and the below article can be thought of as a learning exercise! :)

Factor has several vocabularies for interacting with the filesystem, and quite a lot of work has been done on making these useful. Some interesting blog posts from early 2009 include:

One feature that I haven't found yet, is a word to simply (and recursively) list files within a directory. This is often useful with the intent of processing some (or all) of the files in some way.

This feature exists in other language standard libraries with different names. In Python, this is called os.walk(). In Perl, this is called File::Find. Usually, even if it doesn't exist, it can be built relatively easily.

I'm going to show you how to create a word to do that with Factor.

Many words within the standard library are implemented by a public word that defers to a private word for part of the functionality. In this case, I want to separate the "setup" functionality of the word, from the "work" functionality.

Let's define a word list-files that will recursively list all the files within a directory. First, we need to create a growable sequence (in this case a vector) to hold the result, and then we want to call a word that will be used to do the actual work.

: list-files ( path -- seq )
    [ V{ } clone ] dip (list-files) ;

For each path that we process, we will want to handle differently depending on what type of file the path points to:

  1. If a symbolic link, we want to read the link and recurse into it.
  2. If a directory, we want to recurse into each of the files within it.
  3. If a regular file, we want to add it to the list of files.

Using some of the words in io.directories, io.files, io.files.info, io.files.links, and io.files.types, we can build such a function:

: (list-files) ( seq path -- seq )
    normalize-path dup link-info type>> {
        { +symbolic-link+ [ read-link (list-files) ] }
        { +directory+ [
            [ directory-files ] keep
            '[ normalize-path _ prepend (list-files) ] each ] }
        { +regular-file+ [ over push ] }
        [ "unsupported" throw ]
    } case ;

We can setup a directory structure like so:

$ tree -f /tmp/foo
/tmp/foo
|-- /tmp/foo/bar
`-- /tmp/foo/baz
    `-- /tmp/foo/baz/foo

1 directory, 2 files

And then use our new function from Factor:

( scratchpad ) "/tmp/foo" list-files .
V{ "/tmp/foo/bar" "/tmp/foo/baz/foo" }

This could be improved further by handling file permissions issues, infinite recursion, and lazily generating the list of files (for better performance with large directory trees).

Sunday, December 27, 2009

Generating Text in Factor

Some months back, I came across a few blog posts about generating text using algorithms. A simple algorithm was implemented in Clojure and Haskell.

Basically, the idea is to:

  1. Read in a text document.
  2. Count the frequency of word pairs in the document.
  3. Pick a starting word.
  4. Generate random text, using the word pair probabilities.

I wanted to see what a simple Factor implementation would look like, and thought I would share one below.

First, create a list of words from some lines of text. In Factor, it is easy to write a processing word that can then be used from standard input, files, or other input streams.

: split-words ( -- sequence )
    V{ } clone [
        "\r\n\t.,\" " split [ over push ] each
    ] each-line harvest ;

Next, create a sequence of all word pairs (including the pair linking the tail to the head of the list) from the input sequence.

: word-pairs ( sequence -- sequence' )
    dup 1 head-slice append
    dup 1 tail-slice zip ;

Next, we need a map of word pairs, for random sampling by frequency of occurrence. I have chosen to use words in the text as keys, and have the values be a sequence of all words that ever followed it in the text.

: word-map ( sequence -- assoc )
    [ H{ } clone ] dip word-pairs [
        [ second ] [ first ] bi pick push-at
    ] each ;

This makes sampling words with the proper probability quite simple:

: next-word ( word assoc -- word' )
    at random ;

We can now generate paragraphs of random text with the "feel" of the original document.

: wordgen-from ( start n -- string )
    [ [ 1vector ] keep split-words word-map ] dip
    [ [ next-word dup pick push ] keep ] times
    2drop " " join ;

And for convenience, as in the original blog posts, we can start generating from a common English word.

: wordgen ( n -- string )
    "the" swap wordgen-from ;

Putting all of this together, we can make 250 words from Dracula:

( scratchpad ) "/tmp/345.txt" ascii [ 250 wordgen ] with-file-reader .
"the bank notes of electronic works based on the same time to you are fading and it came I too much Without taking a quick he had struck by our eyes and she was not asleep twice when she sank down again There were striving to think yet when the Huns This to get to her all at her on it really Lucy's father Goodbye my own time when she look at the facts before them away and his diary which is out Look! Look! Look! The bright as ever since then we had said Dr Van Helsing will be a chair with you said Yes! And it was a pistol shot up to write for a woman can love and added Friend John and pain nor a long enough You were so seeing it high tide altogether I awoke She is confined within a room lit by the old sweet that the temptation of lion-like disdain His cries are great love must go there would tear my eyes the colour of him and quiet; but still Renfield sitting on one ready Madam Mina Harker had retired early Lucy and there ran downstairs then we get out his fair accuracy when I found that my dear do our visit to oppose him can look had a measure of the other since I secured and seemed to tell upon your all our furs and the face I put in his eyes never be This afternoon she was that such horrors when the"

Not quite Bram Stoker, but just enough to taste the flavor. More complex algorithms exist and can be used for fun and profit. Similar ideas can even be applied to other areas, such as music.