Sunday, July 25, 2010

New Combinatoric Functions

Combinatorics can provide some useful functions when working with sequences. In Factor, these are mostly defined in the math.combinatorics vocabulary.

USE: math.combinatorics

Inspired by some functions from clojure.contrib, I recently contributed two additional combinatoric words to the Factor project (although not with the same lazy semantics that the Clojure version has).

all-subsets:

The first word, all-subsets, returns all subsets of a given sequence. This can be calculated by iteratively taking n combinations of items from the sequence, where n goes from 0 (the empty set) to length (the sequence itself).

First, we observe how this works by experimenting with the all-combinations word:

( scratchpad ) { 1 2 } 0 all-combinations .
{ { } }

( scratchpad ) { 1 2 } 1 all-combinations .
{ { 1 } { 2 } }

( scratchpad ) { 1 2 } 2 all-combinations .
{ { 1 2 } }

By running it with various n, we have produced all of the subsets of the { 1 2 } sequence. Using a [0,b] range (from 0 to the length of the sequence), we make a sequence of subsets:

: all-subsets ( seq -- subsets )
    dup length [0,b] [
        [ dupd all-combinations [ , ] each ] each
    ] { } make nip ;

The all-subsets word can then be demonstrated by:

( scratchpad ) { 1 2 3 } all-subsets .
{ { } { 1 } { 2 } { 3 } { 1 2 } { 1 3 } { 2 3 } { 1 2 3 } }

selections:

Another useful function, selections, returns all the ways of taking n (possibly the same) elements from a sequence.

First, we observe that there are two base cases:

  1. If we want all ways of taking 0 elements from the sequence, we have only { } (the empty sequence).
  2. If we want all ways of taking 1 element from the sequence, we essentially have a sequence for each element in the input sequence.

If we take more elements from the sequence, we need to apply the cartesian-product word (which returns all possible pairs of elements from two sequences) n-1 times. For example, if we wanted to see all possible selections of 2 elements from a sequence, run the cartesian-product once:

( scratchpad ) { 1 2 3 } dup cartesian-product concat .
{
    { 1 1 }
    { 1 2 }
    { 1 3 }
    { 2 1 }
    { 2 2 }
    { 2 3 }
    { 3 1 }
    { 3 2 }
    { 3 3 }
}

Using these observations, we can build the selections word:

: (selections) ( seq n -- selections )
    dupd [ dup 1 > ] [
        swap pick cartesian-product [
            [ [ dup length 1 > [ flatten ] when , ] each ] each
        ] { } make swap 1 -
    ] while drop nip ;

: selections ( seq n -- selections )
    {
        { 0 [ drop { } ] }
        { 1 [ 1array ] }
        [ (selections) ]
    } case ;

This can be demonstrated by:

( scratchpad ) { 1 2 } 2 selections .
{ { 1 1 } { 1 2 } { 2 1 } { 2 2 } }

Note: we have defined this to take element order into account, so { 1 2 } and { 2 1 } are different possible results. Also, it could be argued that the result for { 1 2 3 } 1 selections should be { 1 2 3 } [ 1array ] map -- perhaps it should change to that in the future.

This was committed to the main repository recently.

Update: A comment by Jon Harper showed me a way to improve all-subsets. Based on that, I also made some changes to selections (perhaps it could be improved even more):

: all-subsets ( seq -- subsets )
    dup length [0,b] [ all-combinations ] with map concat ;

: (selections) ( seq n -- selections )
    [ [ 1array ] map dup ] [ 1 - ] bi* [
        cartesian-product concat [ { } concat-as ] map
    ] with times ;

: selections ( seq n -- selections )
    dup 0 > [ (selections) ] [ 2drop { } ] if ;

Friday, July 16, 2010

Parsing Configuration Files

The other day I needed a parser for INI-style configuration files. When I couldn't find a convenient Factor vocabulary to do this, I decided to write one.

A basic configuration file could look like this:

[owner]
name=John Doe
e-mail=john.doe@example.com

[database]
host=127.0.0.1  # change to production when ready
port=1234
username=test
password="a really long string"

These configurations are essentially groups of name/value pairs, and can be naturally expressed as an assoc. We will be implementing a simple API for reading and writing:

: read-ini ( -- assoc )

: write-ini ( assoc -- )

: string>ini ( str -- assoc )

: ini>string ( assoc -- str )

This implementation uses these vocabularies:

USING: arrays assocs combinators formatting hashtables io
io.streams.string kernel make math sequences strings
strings.parser ;

Some utility words are used to trim spaces from tokens, extract strings from section names (e.g., "[database]"), and remove comments from lines:

: unspace ( str -- str' )
    [ " \t\n\r" member? ] trim ;

: unwrap ( str -- str' )
    1 swap [ length 1 - ] keep subseq ;

: uncomment ( str -- str' )
    CHAR: # over index [ head ] when* ;

There are a variety of parsing strategies we could use here. To keep things simple, we will be parsing the configuration file line-by-line. Also, we will make the assumption that each line contains either a "[section]" or a "name=value" (but not both).

We know a line is a section if it starts with '[' and ends with ']':

: section? ( line -- ? )
    [ first CHAR: [ = ] [ last CHAR: ] = ] bi and ;

The current section is parsed and stored as a two-element array containing the name of the section and a vector of name/value pairs:

: [section] ( line -- section )
    unwrap unspace V{ } clone 2array ;

Each name/value is parsed and added to the vector of name/value pairs in the current section:

: name=value ( section line -- section' )
    CHAR: = over index cut rest [ unspace ] bi@
    2array over second push ;

We will be using the make words. When we encounter a new section, or the end of the file, we will append the current section to the sequence of sections being built by make:

: section, ( section/f -- )
    [ first2 >hashtable 2array , ] when* ;

: parse-line ( section line -- section' )
    uncomment unspace [
        dup section?
        [ swap section, [section] ] [ name=value ] if
    ] unless-empty ;

: read-ini ( -- assoc )
    [
        f [ parse-line ] each-line section,
    ] { } make >hashtable ;

Implementing write-ini is pretty easy. It's just a matter of iterating over all values in the specified assoc, and printing them out with some minor structure:

: write-ini ( assoc -- )
    [
        [ "[%s]\n" printf ] dip
        [ "%s=%s\n" printf ] assoc-each
        nl
    ] assoc-each ;

The string>ini and ini>string words are easy too. Both the read-ini and write-ini words operate on input and output streams, so we can use string streams:

: string>ini ( str -- assoc )
    [ read-ini ] with-string-reader ;

: ini>string ( assoc -- str )
    [ write-ini ] with-string-writer ;

This was a really simple implementation. In addition to the basics, I wanted to be able to support:

  • Embedded escape characters (e.g., "\t", "\n", etc.).
  • Line continuations (e.g., multi-line values).
  • Java .properties files.
  • Liberal parsing of minor formatting errors.
  • Support both '#' and ';' comment characters.
  • Quoted strings (e.g., name="value").

You can find all that and more (along with tests and some documentation) on my Github. I hope to contribute it to the main repository soon.

Wednesday, July 14, 2010

Flipping text upside-down

Over the last few years, there has been a recurring meme to flip text upside-down. You can find posts about it, online text-flipping services, and programming libraries (e.g., for Perl and Emacs).

All of these implementations seem to work the same way: find Unicode characters that resemble upside-down version of the latin alphabet and create a mapping that is used to turn text "upside-down". Since Factor has strong support for Unicode, I thought it could use a library to flip strings.

What it should look like when we are done:

( scratchpad ) "abcdefghijklmnopqrstuvwxyz1234567890" flip-text .
"068Ɫ95ᔭƐᄅ⇂zʎxʍʌnʇsɹbdouɯʃʞɾᴉɥᵷɟǝpɔqɐ"

As is typical, we list our dependencies and create a namespace to hold the new functions:

USING: assocs kernel sequences ;

IN: flip-text

Next, we will create the character mapping (simply hard-coding the character lookups):

CONSTANT: CHARS H{
    { CHAR: a   HEX: 0250 }
    { CHAR: b   CHAR: q   }
    { CHAR: c   HEX: 0254 }
    { CHAR: d   CHAR: p   }
    { CHAR: e   HEX: 01DD }
    { CHAR: f   HEX: 025F }
    ...

And then, since it is useful to make the flip-text word reversible (e.g., return the original value if applied again), we will update the mapping with the reverse entries:

CHARS [ CHARS set-at ] assoc-each

Mapping a single character is pretty straight-forward (making sure to pass the original character through if a mapping isn't found):

: ch>flip ( ch -- ch' )
    dup CHARS at [ nip ] when* ;

And then flipping a string of text is just a matter of flipping each character and reversing the string:

: flip-text ( str -- str' )
    [ ch>flip ] map reverse ;

The complete implementation for this is on my Github account.

Tuesday, July 13, 2010

Side by Side: Chi-Square Functions

I noticed a blog post from the end of last year that compared Python, Common Lisp, and Clojure implementations of a probability function used in bayesian spam filters.

The main point of the post seems to be exploring the syntax used by the programming languages as well as some commentary on using library functions for simplification. (In this case "simple" appears to mean both understandable and low token count (an attribute descriptively valued in Paul Graham's Arc Challenge).

The original article that inspired the author's post gives a Python function for calculating the probability of observing a value at least as extreme in a chi-square distribution with a given degrees of freedom:

def chi2P(chi, df):
    assert df & 1 == 0
    m = chi / 2.0
    sum = term = math.exp(-m)
    for i in range(1, df//2):
        term *= m / i
        sum += term
    return min(sum, 1.0)

Using a few vocabularies:

USING: kernel math math.functions math.order math.ranges
sequences ;

We can quickly translate the original function into Factor:

: chi2P ( chi df -- result )
    [ even? [ "odd degrees of freedom" throw ] unless ] keep
    [ 2.0 / ] [ 2 /i ] bi* [1,b) [ dupd / ] map
    [ neg exp dup ] dip [ * [ + ] keep ] each drop 1.0 min ;

Written as a single word, and with some of the stack shuffling, its a little hard to grok the function without mentally stepping through the code. Instead, we can make a few small changes:

  • Switch to using vector operations.
  • Extract the "guts" of the chi2P word into an "internal" word.
  • Extract the assertion that degrees of freedom should be even into a new word.

The new code looks like:

: df-check ( df -- )
    even? [ "odd degrees of freedom" throw ] unless ;

: (chi2P) ( chi/2 df/2 -- P )
    [1,b) dupd n/v swap neg exp [ v*n sum ] keep + ;

: chi2P ( chi df -- P )
    dup df-check [ 2.0 / ] [ 2 /i ] bi* (chi2P) 1.0 min ;

To my eye, that looks "better". Code can often be improved by smaller functions, better word or argument names, and some thought to the visual structure of the syntax. This was just a simple refactoring and (as with most things in life) it could be improved further.

Monday, July 12, 2010

Curious? Factor to the rescue!

A lot of dynamic language users have strong interest in math (particularly those who use functional programming languages). Some months back I came across a fun blog post about using Clojure to approximate Euler's Number, according to this observation:

"Here is an example of e turning up unexpectedly. Select a random number between 0 and 1. Now select another and add it to the first. Keep doing this, piling on random numbers. How many random numbers, on average, do you need to make the total greater than 1? Answer: 2.71828..."

I wanted to contrast his solution with one using Factor. We'll use these vocabularies:

USING: kernel math random sequences ;

We can use the random vocabulary to create a random float between 0 and 1, counting how many numbers we need to add before the value is greater than 1:

: numbers-added ( -- n )
    0 0 [ dup 1 < ] [
        [ 1 + ] dip 0.0 1.0 uniform-random-float +
    ] while drop ;

Taking the average of a sequence of numbers is pretty easy:

: average ( seq -- n )
    [ sum ] [ length ] bi / ;

Similar to the original blog post, we'll want to estimate the value of e by running numbers-added a certain number of times and then taking the average of the resulting sequence:

: approximate-e ( n -- approx )
    [ numbers-added ] replicate average ;

Running this from one thousand to one million times gives us an increasingly accurate approximation of e:

( scratchpad ) { 1000 10000 100000 1000000 }
               [ approximate-e >float ] map .
{ 2.694 2.7186 2.72149 2.716958 }

( scratchpad ) USING: math.constants ; e .
2.718281828459045

This could be improved by virtualizing the sequences and then performing an incremental average as we generate each of n approximations. That would likely result in both memory and speed improvements (although I haven't tried it yet).

The code for this is on my Github.

Tuesday, July 6, 2010

Where am I? The 10:10 code...

A couple months ago, I came across a blog post discussing an encoding of geographic coordinates that might be "easier" to remember or exchange than latitude/longitude pairs.

The encoding, known as "10:10", is able to identify to a 10 meter precision any location on the planet using a 10 character code. Without going into the merits of the scheme, I will just note that there is some good discussion in the comments section about ideas for its improvement.

Under the assumption that most programming blog posts are invitations to port things into Factor, I decided to implement the 10:10 code. Unfortunately, the provided Javascript code uses relatively meaningless variable names, so my implementation suffers via inheritance. Essentially, the code boils down into a couple steps:

  1. Calculate a single p number representing a latitude/longitude pair.
  2. Calculate a tt number representing a version of p modified by a checksum.
  3. Generate a 10:10 code by iteratively indexing into the alphabet.

We will separate each of these steps into words that can be tested individually and then combined to compute the 10:10 codes.

We will need some vocabularies and a namespace:

USING: kernel math math.functions math.ranges memoize sequences
strings ;

IN: ten-ten

The 10:10 code uses a 29 character alphabet, so we will need to define it:

CONSTANT: ALPHABET "ABCDEFGHJKMNPQRVWXY0123456789"

MEMO: BASE ( -- base ) ALPHABET length ;

First, we compute the p value:

: p ( lat lon -- p )
    [ 90 + ] [ 180 + ] bi*
    [ 10000 * floor >fixnum ] bi@
    [ 3600000 * ] dip + ;

Then, we calculate and add the checksum to create a tt value:

: tt ( p -- tt )
    [ BASE * ] keep 10 [1,b) [
        [ BASE /mod ] dip *
    ] map nip sum BASE mod + floor ;

Next, we convert the tt value to a string (inserting spaces for legibility):

: tt>string ( tt -- str )
    10 [ BASE /mod ALPHABET nth ] replicate 
    nip reverse >string
    [ CHAR: \s 3 ] dip insert-nth
    [ CHAR: \s 7 ] dip insert-nth ;

And, finally, putting it all together:

: ten-ten ( lat lon -- tt )
    p tt tt>string ;

When I run 51.09559 1.12207 ten-ten (the included example for the location of the Eurotunnel in the UK), it gives me a value of MEQ N6G 7NY5. This is not the same as the original author, but matches a version written in C# by another commenter as well as a version I wrote in Python to validate my results.

You can find my implementation on my Github account.

Thursday, July 1, 2010

Rolling Dice

Update: I noticed that random is [0,b) (from 0 to n-1). The code has been fixed to make it [1,b] (e.g., return a number from 1 to n).

I was curious about the ability to define new syntax for Factor, and to learn a little about how the lexer works. Inspired by a recent project that I've been working on, I thought it would be interesting to define a simple DSL for calculating dice rolls.

We need a way to describe a dice roll. There are various methods used in Python or Ruby. A common example of a common short-hand description is 4d8 -- specifying how many times (4) to roll a dice with a number (8) of sides.

Let's setup a vocabulary and a list of dependencies that will be used to implement this functionality:

USING: fry kernel lexer math math.parser peg.ebnf random
sequences strings ;

IN: dice

We will be using EBNF parsers similar to what was used in building a calculator. First we will implement support for basic (e.g., 4d8) rolls.

EBNF: parse-roll

number = ([0-9])+    => [[ >string string>number ]]
dice   = "d" number  => [[ second '[ _ random ] ]]
roll   = number dice => [[ first2 '[ 0 _ [ @ + 1 + ] times ] ]]
error  = .*          => [[ "unknown dice" throw ]]
total  = roll | error

;EBNF

We can see how this works by trying it out:

( scratchpad ) "4d8" parse-roll .
[ 0 4 [ 8 random + 1 + ] times ]

( scratchpad ) "4d8" parse-roll call .
15

( scratchpad ) "foo" parse-roll
unknown dice

We can now define a piece of SYNTAX: for dice rolls:

SYNTAX: ROLL: scan parse-roll append ;

And using it:

( scratchpad ) ROLL: 4d8 .
13

This could be extended to support other things such as "base" numbers (e.g., 4d8+10), negative rolls, ranges, and better handling of parse errors. The code for this is available on my Github.