Tuesday, July 13, 2010

Side by Side: Chi-Square Functions

I noticed a blog post from the end of last year that compared Python, Common Lisp, and Clojure implementations of a probability function used in bayesian spam filters.

The main point of the post seems to be exploring the syntax used by the programming languages as well as some commentary on using library functions for simplification. (In this case "simple" appears to mean both understandable and low token count (an attribute descriptively valued in Paul Graham's Arc Challenge).

The original article that inspired the author's post gives a Python function for calculating the probability of observing a value at least as extreme in a chi-square distribution with a given degrees of freedom:

def chi2P(chi, df):
    assert df & 1 == 0
    m = chi / 2.0
    sum = term = math.exp(-m)
    for i in range(1, df//2):
        term *= m / i
        sum += term
    return min(sum, 1.0)

Using a few vocabularies:

USING: kernel math math.functions math.order math.ranges
sequences ;

We can quickly translate the original function into Factor:

: chi2P ( chi df -- result )
    [ even? [ "odd degrees of freedom" throw ] unless ] keep
    [ 2.0 / ] [ 2 /i ] bi* [1,b) [ dupd / ] map
    [ neg exp dup ] dip [ * [ + ] keep ] each drop 1.0 min ;

Written as a single word, and with some of the stack shuffling, its a little hard to grok the function without mentally stepping through the code. Instead, we can make a few small changes:

  • Switch to using vector operations.
  • Extract the "guts" of the chi2P word into an "internal" word.
  • Extract the assertion that degrees of freedom should be even into a new word.

The new code looks like:

: df-check ( df -- )
    even? [ "odd degrees of freedom" throw ] unless ;

: (chi2P) ( chi/2 df/2 -- P )
    [1,b) dupd n/v swap neg exp [ v*n sum ] keep + ;

: chi2P ( chi df -- P )
    dup df-check [ 2.0 / ] [ 2 /i ] bi* (chi2P) 1.0 min ;

To my eye, that looks "better". Code can often be improved by smaller functions, better word or argument names, and some thought to the visual structure of the syntax. This was just a simple refactoring and (as with most things in life) it could be improved further.

No comments: