Friday, December 13, 2013

Monte Carlo

The Monte Carlo method is a method of estimation that uses random simulation to solve physical or mathematical problems. Named after the Monte Carlo Casino, you can think of the method as playing a game of chance many times and recording the outcomes (such as how frequently one wins or loses).

One classic example from the Wikipedia article is estimating the value of π (although there are many other ways to approximate the value of π):

For example, consider a circle inscribed in a unit square. Given that the circle and the square have a ratio of areas that is π/4, the value of π can be approximated using a Monte Carlo method:

  1. Draw a square on the ground, then inscribe a circle within it.
  2. Uniformly scatter some objects of uniform size (grains of rice or sand) over the square.
  3. Count the number of objects inside the circle and the total number of objects.
  4. The ratio of the two counts is an estimate of the ratio of the two areas, which is π/4. Multiply the result by 4 to estimate π.

You can visualize how this works by seeing the estimate get more correct as you increase the number of points:

We can generate a random point inside the unit circle by generating two random-unit values:

: random-point ( -- x y )
    random-unit random-unit ;

Using the Pythagorean theorem, we can calculate the distance from the zero point. If the distance is less than or equal to 1.0, then it is inside the circle:

: inside-circle? ( x y -- ? )
    [ sq ] bi@ + sqrt 1.0 <= ;

We can then estimate the value of π by computing a number of points, taking the percentage that are inside the circle and multiplying by 4:

: estimate-pi ( points -- pi-estimate )
    0 swap [
        [ random-point inside-circle? [ 1 + ] when ] times
    ] keep /f 4 * ;

We can run this for varying numbers of points and see how it gets more accurate:

IN: scratchpad { 100 1,000 10,000 100,000 1,000,000 10,000,000 }
               [ estimate-pi . ] each
3.2
3.168
3.162
3.15176
3.14288
3.1418212

Tuesday, December 10, 2013

UU Encoding

Just a quick note, as of a couple months ago, Factor has support for uuencoding (and uudecoding)!

You can perform a uuencode:

IN: scratchpad "Factor" string>uu print
begin
&1F%C=&]R
end

...and also a uudecode:

IN: scratchpad """
               begin 644 factor.txt
               &1F%C=&]R
               end
               """ uu>string .
"Factor"

Right now, it operates on text directly and doesn't preserve the file name and permissions in the begin header, but that would be an easy improvement.

The code for this is available in the development version of Factor.

Thursday, December 5, 2013

Humanhash

Zachary Voase published the humanhash project on GitHub, making "human-readable representations of digests". Below is a compatible implementation in Factor.

To show how it will work, we use the example from his humanhash README:

IN: scratchpad CONSTANT: digest "7528880a986c40e78c38115e640da2a1"

IN: scratchpad digest humanhash .
"three-georgia-xray-jig"

IN: scratchpad digest 6 humanhash-words .
"high-mango-white-oregon-purple-charlie"

IN: scratchpad human-uuid4 2array .
{
    "28129036-75a7-4c87-984b-4b32231e0a0d"
    "nineteen-bluebird-oxygen-edward"
}

Implementation

We need a list of 256 words, one to represent each possible byte:

CONSTANT: default-wordlist {
    "ack" "alabama" "alanine" "alaska" "alpha" "angel" "apart"
    "april" "arizona" "arkansas" "artist" "asparagus" "aspen"
    "august" "autumn" "avocado" "bacon" "bakerloo" "batman" "beer"
    "berlin" "beryllium" "black" "blossom" "blue" "bluebird" "bravo"
    "bulldog" "burger" "butter" "california" "carbon" "cardinal"
    "carolina" "carpet" "cat" "ceiling" "charlie" "chicken" "coffee"
    "cola" "cold" "colorado" "comet" "connecticut" "crazy" "cup"
    "dakota" "december" "delaware" "delta" "diet" "don" "double"
    "early" "earth" "east" "echo" "edward" "eight" "eighteen"
    "eleven" "emma" "enemy" "equal" "failed" "fanta" "fifteen"
    "fillet" "finch" "fish" "five" "fix" "floor" "florida"
    "football" "four" "fourteen" "foxtrot" "freddie" "friend"
    "fruit" "gee" "georgia" "glucose" "golf" "green" "grey" "hamper"
    "happy" "harry" "hawaii" "helium" "high" "hot" "hotel"
    "hydrogen" "idaho" "illinois" "india" "indigo" "ink" "iowa"
    "island" "item" "jersey" "jig" "johnny" "juliet" "july"
    "jupiter" "kansas" "kentucky" "kilo" "king" "kitten" "lactose"
    "lake" "lamp" "lemon" "leopard" "lima" "lion" "lithium" "london"
    "louisiana" "low" "magazine" "magnesium" "maine" "mango" "march"
    "mars" "maryland" "massachusetts" "may" "mexico" "michigan"
    "mike" "minnesota" "mirror" "mississippi" "missouri" "mobile"
    "mockingbird" "monkey" "montana" "moon" "mountain" "muppet"
    "music" "nebraska" "neptune" "network" "nevada" "nine"
    "nineteen" "nitrogen" "north" "november" "nuts" "october" "ohio"
    "oklahoma" "one" "orange" "oranges" "oregon" "oscar" "oven"
    "oxygen" "papa" "paris" "pasta" "pennsylvania" "pip" "pizza"
    "pluto" "potato" "princess" "purple" "quebec" "queen" "quiet"
    "red" "river" "robert" "robin" "romeo" "rugby" "sad" "salami"
    "saturn" "september" "seven" "seventeen" "shade" "sierra"
    "single" "sink" "six" "sixteen" "skylark" "snake" "social"
    "sodium" "solar" "south" "spaghetti" "speaker" "spring"
    "stairway" "steak" "stream" "summer" "sweet" "table" "tango"
    "ten" "tennessee" "tennis" "texas" "thirteen" "three" "timing"
    "triple" "twelve" "twenty" "two" "uncle" "undress" "uniform"
    "uranus" "utah" "vegan" "venus" "vermont" "victor" "video"
    "violet" "virginia" "washington" "west" "whiskey" "white"
    "william" "winner" "winter" "wisconsin" "wolfram" "wyoming"
    "xray" "yankee" "yellow" "zebra" "zulu"
}

One of the inputs is the number of words to produce. An error is produced if fewer bytes are provided than the number of words requested:

ERROR: too-few-bytes seq #words ;

: check-bytes ( seq #words -- seq #words )
    2dup [ length ] [ < ] bi* [ too-few-bytes ] when ; inline

Input is grouped into subsequences, where the number of subsequences is the number of words requested in the output. It's a little bit odd, but basically makes groups and then puts any remainder in the last group:

: group-words ( seq #words -- groups )
    [ dupd [ length ] [ /i ] bi* group ]
    [ 1 - cut concat suffix ] bi ; inline

Our input bytes are compressed, first by grouping them into words, then by XORing the bytes in each word:

: compress-bytes ( seq #words -- newseq )
    check-bytes group-words [ 0 [ bitxor ] reduce ] map ;

Our input will either be a byte-array, or a hex-string with every two characters representing the hexadecimal value of each byte:

: byte-string ( hexdigest -- seq )
    dup byte-array? [ 2 <groups> [ hex> ] map ] unless ;

Making a humanhash is simply converting the input, compressing into bytes representing each word, looking up the word from the word list, and joining with a requested separator:

: make-humanhash ( hexdigest #words wordlist sep -- hash )
    { [ byte-string ] [ compress-bytes ] [ nths ] [ join ] } spread ;

We provide a way to hash into a requested number of words, or four by default:

: humanhash-words ( hexdigest #words -- hash )
    default-wordlist "-" make-humanhash ;

: humanhash ( hexdigest -- hash )
    4 humanhash-words ;

And since the humanhash project includes a way to create humanhash'd uuids, we do also:

: human-uuid4 ( -- uuid hash )
    uuid4 dup [ CHAR: - = not ] filter humanhash ;

Thursday, November 28, 2013

tzfile

I have wanted to parse timezone information files (also known as "tzfile") for awhile. In particular, so that Factor can begin to support named timezones in a smarter way.

Parsing

The tzfile is a binary format file from the tz database (also known as the "zoneinfo database"). Each tzfile starts with the four magic bytes "TZif", which we can check:

ERROR: bad-magic ;

: check-magic ( -- )
    4 read "TZif" sequence= [ bad-magic ] unless ;

The tzfile then contains a header followed by a series of ttinfo structures and other information:

STRUCT: tzhead
    { tzh_reserved char[16] }
    { tzh_ttisgmtcnt be32 }
    { tzh_ttisstdcnt be32 }
    { tzh_leapcnt be32 }
    { tzh_timecnt be32 }
    { tzh_typecnt be32 }
    { tzh_charcnt be32 } ;

PACKED-STRUCT: ttinfo
    { tt_gmtoff be32 }
    { tt_isdst uchar }
    { tt_abbrind uchar } ;

We can store all the information parsed from the tzfile in a tuple:

TUPLE: tzfile header transition-times local-times types abbrevs
leaps is-std is-gmt ;

C: <tzfile> tzfile

With a helper word to read 32-bit big-endian numbers, we can parse the entire file:

: read-be32 ( -- n )
    4 read be32 deref ;

: read-tzfile ( -- tzfile )
    check-magic tzhead read-struct dup {
        [ tzh_timecnt>> [ read-be32 ] replicate ]
        [ tzh_timecnt>> [ read1 ] replicate ]
        [ tzh_typecnt>> [ ttinfo read-struct ] replicate ]
        [ tzh_charcnt>> read ]
        [ tzh_leapcnt>> [ read-be32 read-be32 2array ] replicate ]
        [ tzh_ttisstdcnt>> read ]
        [ tzh_ttisgmtcnt>> read ]
    } cleave <tzfile> ;

All of that data specifies a series of local time types and transition times:

TUPLE: local-time gmt-offset dst? abbrev std? gmt? ;

C: <local-time> local-time

TUPLE: transition seconds timestamp local-time ;

C: <transition> transition

The abbreviated local time names are stored in a flattened array. It would be helpful to parse them out into a hashtable where the key is the starting character index in the flattened array:

:: tznames ( abbrevs -- assoc )
    0 [
        0 over abbrevs index-from dup
    ] [
        [ dupd abbrevs subseq >string 2array ] keep 1 + swap
    ] produce 2nip >hashtable ;

We can now construct an array of all the transition times and the local time types they represent. This is a lot of logic for a typical Factor word, so we use local variables to make it easier to understand:

:: tzfile>transitions ( tzfile -- transitions )
    tzfile abbrevs>> tznames :> abbrevs
    tzfile is-std>> :> is-std
    tzfile is-gmt>> :> is-gmt
    tzfile types>> [
        [
            {
                [ tt_gmtoff>> seconds ]
                [ tt_isdst>> 1 = ]
                [ tt_abbrind>> abbrevs at ]
            } cleave
        ] dip
        [ is-std ?nth dup [ 1 = ] when ]
        [ is-gmt ?nth dup [ 1 = ] when ] bi <local-time>
    ] map-index :> local-times
    tzfile transition-times>>
    tzfile local-times>> [
        [ dup unix-time>timestamp ] [ local-times nth ] bi*
        <transition>
    ] 2map ;

We want to wrap the tzfile parsed structure and the transitions in a tzinfo object that can be used later with timestamps. These tzinfo objects are created by parsing from specific files by path or by their zoneinfo name:

TUPLE: tzinfo tzfile transitions ;

C: <tzinfo> tzinfo

: file>tzinfo ( path -- tzinfo )
    binary [
        read-tzfile dup tzfile>transitions <tzinfo>
    ] with-file-reader ;

: load-tzinfo ( name -- tzinfo )
    "/usr/share/zoneinfo/" prepend file>tzinfo ;

Timestamps

Now that we have the tzinfo, we can convert a UTC timestamp into the timezone specified by our tzfile. This is accomplished by finding the transition time that affects the requested timestamp and adjusting by the GMT offset that it represents:

: find-transition ( timestamp tzinfo -- transition )
    [ timestamp>unix-time ] [ transitions>> ] bi*
    [ [ seconds>> before? ] with find drop ]
    [ swap [ 1 [-] swap nth ] [ last ] if* ] bi ;

: from-utc ( timestamp tzinfo -- timestamp' )
    [ drop instant >>gmt-offset ]
    [ find-transition local-time>> gmt-offset>> ] 2bi
    convert-timezone ;

Or normalize a timestamp that might be in a different timezone into the timezone specified by our tzfile (converting into and then out of UTC):

: normalize ( timestamp tzinfo -- timestamp' )
    [ instant convert-timezone ] [ from-utc ] bi* ;

Example

An example of it working, taking a date in PST that is after a daylight savings transition, printing it out then subtracting 10 minutes and normalizing to the "US/Pacific" zoneinfo file, printing it out showing the time in PDT:

IN: scratchpad ! Take a time in PST
               2002 10 27 1 0 0 -8 hours <timestamp>

               ! Print it out
               dup "%c" strftime .
"Sun Oct 27 01:00:00 2002"

IN: scratchpad ! Subtract 10 minutes
               10 minutes time-

               ! Normalize to US-Pacific timezone
               "US/Pacific" load-tzinfo normalize

               ! Print it out
               "%c" strftime .
"Sun Oct 27 01:50:00 2002"

The code for this is available in the development version of Factor.

Monday, November 25, 2013

N-Numbers

In the United States, "N-Numbers" are the name given to aircraft registrations. Some of the services that the FAA provides include the ability to lookup aircraft by N-Number or reserve an N-Number.

Below we implement the rules to detect if a string is a valid N-Number in Factor.

  • may not begin with zero.
  • may not be the letters "I" or "O" to avoid confusion with the numbers one or zero.
: (n-number?) ( digits letters -- ? )
    [ dup first CHAR: 0 = [ drop f ] [ [ digit? ] all? ] if ]
    [ [ [ Letter? ] [ "IiOo" member? not ] bi and ] all? ]
    bi* and ;
  • may be one to five numbers (e.g., N12345).
  • may be one to four numbers and one suffix letter (e.g., N1A and N1234Z).
  • may be one to three numbers and two suffix letters (e.g., N24BY and N123AZ).
: n-number? ( str -- ? )
    "N" ?head drop {
        [ { [ length 1 5 between? ] [ f (n-number?) ] } 1&& ]
        [ { [ length 2 5 between? ] [ 1 cut* (n-number?) ] } 1&& ]
        [ { [ length 3 5 between? ] [ 2 cut* (n-number?) ] } 1&& ]
    } 1|| ;

Registration numbers N1 through N99 are reserved for Federal Aviation Administration (FAA) internal use and are not available.

: reserved? ( str -- ? )
    "N" ?head drop
    { [ length 1 2 between? ] [ [ digit? ] all? ] } 1&& ;

The code and some tests for this is on my GitHub.

Friday, November 22, 2013

tree

I've used Factor to build several common unix programs including copy, cat, fortune, wc, move, uniq, and others.

Today, I wanted to show how to build the tree program. If we look at the man page, we can see that it is used to:

Tree is a recursive directory listing program that produces a depth indented listing of files. With no arguments, tree lists the files in the current directory. When directory arguments are given, tree lists all the files and/or directories found in the given directories each in turn. Upon completion of listing all files/directories found, tree returns the total number of files and/or directories listed.

We need to keep track of files and directories that are traversed:

SYMBOL: #files
SYMBOL: #directories

Each file name is indented according to a specified depth:

: indent ( n -- )
    [ [ "|   " write ] times ] unless-zero "+-- " write ;

: write-name ( indent entry -- )
    [ indent ] [ name>> write ] bi* ;

File names are written and increment the #files counter:

: write-file ( indent entry -- )
    write-name #files [ 1 + ] change-global ;

Directory names are written, increase the indent, recurse writing their directory tree, and increment the #directories counter:

DEFER: write-tree

: write-dir ( indent entry -- )
    [ write-name ] [
        [ [ 1 + ] [ name>> ] bi* write-tree ]
        [ 3drop " [error opening dir]" write ] recover
    ] 2bi #directories [ 1 + ] change-global ;

Using write-file and write-dir, we can implement write-tree to sort the entries and then write each according to their type (e.g., file or directory):

: write-tree ( indent path -- )
    [
        [ name>> ] sort-with [
            nl [ dup ] bi@ type>> +directory+ =
            [ write-dir ] [ write-file ] if
        ] each drop
    ] with-directory-entries ;

Finally, we can implement the tree command, initializing the file and directory count, recursing on the path specified, and then printing out the number of files and directories observed:

: tree ( path -- )
    0 #directories set-global 0 #files set-global
    [ write ] [ 0 swap write-tree ] bi nl
    #directories get-global #files get-global
    "\n%d directories, %d files\n" printf ;

Our command-line word will either operate on the arguments specifying a list of directories to process, or the current directory if none are provided:

: run-tree ( -- )
    command-line get [
        current-directory get tree
    ] [
        [ tree ] each
    ] if-empty ;

MAIN: run-tree

This is available on my GitHub.

Monday, November 18, 2013

MessagePack

MessagePack is an "efficient binary serialization format" designed to be faster and smaller than JSON and has support in many programming languages.

Less than 150 lines of code!

What is that, you ask? Why thats the number of lines of code it took to implement a MessagePack encoder and decoder in Factor.

Reading

For decoding, our strategy will be to create a read-msgpack word that can operate on the current input-stream (allowing reuse for MessagePack objects read from files, strings, or the network).

DEFER: read-msgpack

Aside from support for basic data types such as integers, floating-point numbers, and strings, we also need to support arrays of objects, maps of key/value pairs, and so-called "extended" object types:

: read-array ( n -- obj )
    [ read-msgpack ] replicate ;

: read-map ( n -- obj )
    2 * read-array 2 group >hashtable ;

: read-ext ( n -- obj )
    read be> [ 1 read signed-be> ] dip read 2array ;

We need a way to specify a "nil" (or "null") object since we use t and f for booleans:

SINGLETON: +msgpack-nil+

And, of course, an error to indicate when a requested format is not supported:

ERROR: unknown-format n ;

With those definitions done, we can build a word to read a single MessagePack object from a stream:

: read-msgpack ( -- obj )
    read1 {
        { [ dup 0xc0 = ] [ drop +msgpack-nil+ ] }
        { [ dup 0xc2 = ] [ drop f ] }
        { [ dup 0xc3 = ] [ drop t ] }
        { [ dup 0x00 0x7f between? ] [ ] }
        { [ dup 0xe0 mask? ] [ 1array signed-be> ] }
        { [ dup 0xcc = ] [ drop read1 ] }
        { [ dup 0xcd = ] [ drop 2 read be> ] }
        { [ dup 0xce = ] [ drop 4 read be> ] }
        { [ dup 0xcf = ] [ drop 8 read be> ] }
        { [ dup 0xd0 = ] [ drop 1 read signed-be> ] }
        { [ dup 0xd1 = ] [ drop 2 read signed-be> ] }
        { [ dup 0xd2 = ] [ drop 4 read signed-be> ] }
        { [ dup 0xd3 = ] [ drop 8 read signed-be> ] }
        { [ dup 0xca = ] [ drop 4 read be> bits>float ] }
        { [ dup 0xcb = ] [ drop 8 read be> bits>double ] }
        { [ dup 0xe0 mask 0xa0 = ] [ 0x1f mask read ] }
        { [ dup 0xd9 = ] [ drop read1 read "" like ] }
        { [ dup 0xda = ] [ drop 2 read be> read "" like ] }
        { [ dup 0xdb = ] [ drop 4 read be> read "" like ] }
        { [ dup 0xc4 = ] [ drop read1 read B{ } like ] }
        { [ dup 0xc5 = ] [ drop 2 read be> read B{ } like ] }
        { [ dup 0xc6 = ] [ drop 4 read be> read B{ } like ] }
        { [ dup 0xf0 mask 0x90 = ] [ 0x0f mask read-array ] }
        { [ dup 0xdc = ] [ drop 2 read be> read-array ] }
        { [ dup 0xdd = ] [ drop 4 read be> read-array ] }
        { [ dup 0xf0 mask 0x80 = ] [ 0x0f mask read-map ] }
        { [ dup 0xde = ] [ drop 2 read be> read-map ] }
        { [ dup 0xdf = ] [ drop 4 read be> read-map ] }
        { [ dup 0xd4 = ] [ drop 1 read-ext ] }
        { [ dup 0xd5 = ] [ drop 2 read-ext ] }
        { [ dup 0xd6 = ] [ drop 4 read-ext ] }
        { [ dup 0xd7 = ] [ drop 8 read-ext ] }
        { [ dup 0xd8 = ] [ drop 16 read-ext ] }
        { [ dup 0xc7 = ] [ drop read1 read-ext ] }
        { [ dup 0xc8 = ] [ drop 2 read be> read-ext ] }
        { [ dup 0xc9 = ] [ drop 4 read be> read-ext ] }
        [ unknown-format ]
    } cond ;

Pretty simple!

Writing

For encoding, our strategy will be to define a generic write-msgpack word that will dispatch off the type of object being encoded and operate on the current output-stream (allowing reuse for MessagePack objects written to files, strings, or the network).

GENERIC: write-msgpack ( obj -- )

And, of course, an error to indicate when a requested object type isn't supported:

ERROR: cannot-convert obj ;

Writing the "nil" (or "null" object) and boolean values true and false:

M: +msgpack-nil+ write-msgpack drop 0xc0 write1 ;

M: f write-msgpack drop 0xc2 write1 ;

M: t write-msgpack drop 0xc3 write1 ;

Support for integers and floating point numbers:

M: integer write-msgpack
    dup 0 >= [
        {
            { [ dup 0x7f <= ] [ write1 ] }
            { [ dup 0xff <= ] [ 0xcc write1 1 >be write ] }
            { [ dup 0xffff <= ] [ 0xcd write1 2 >be write ] }
            { [ dup 0xffffffff <= ] [ 0xce write1 4 >be write ] }
            { [ dup 0xffffffffffffffff <= ] 
                [ 0xcf write1 8 >be write ] }
            [ cannot-convert ]
        } cond
    ] [
        {
            { [ dup -0x1f >= ] [ 1 >be write ] }
            { [ dup -0x80 >= ] [ 0xd0 write1 1 >be write ] }
            { [ dup -0x8000 >= ] [ 0xd1 write1 2 >be write ] }
            { [ dup -0x80000000 >= ] [ 0xd2 write1 4 >be write ] }
            { [ dup -0x8000000000000000 >= ] 
                [ 0xd3 write1 8 >be write ] }
            [ cannot-convert ]
        } cond
    ] if ;

M: float write-msgpack
    0xcb write1 double>bits 8 >be write ;

Support for strings and byte-arrays:

M: string write-msgpack
    dup length {
        { [ dup 0x1f <= ] [ 0xa0 bitor write1 ] }
        { [ dup 0xff <= ] [ 0xd9 write1 write1 ] }
        { [ dup 0xffff <= ] [ 0xda write1 2 >be write ] }
        { [ dup 0xffffffff <= ] [ 0xdb write1 4 >be write ] }
        [ cannot-convert ]
    } cond write ;

M: byte-array write-msgpack
    dup length {
        { [ dup 0xff <= ] [ 0xc4 write1 write1 ] }
        { [ dup 0xffff <= ] [ 0xc5 write1 2 >be write ] }
        { [ dup 0xffffffff <= ] [ 0xc6 write1 4 >be write ] }
        [ cannot-convert ]
    } cond write ;

Support for arrays of MessagePack objects:

: write-array-header ( n -- )
    {
        { [ dup 0xf <= ] [ 0x90 bitor write1 ] }
        { [ dup 0xffff <= ] [ 0xdc write1 2 >be write ] }
        { [ dup 0xffffffff <= ] [ 0xdd write1 4 >be write ] }
        [ cannot-convert ]
    } cond ;

M: sequence write-msgpack
    dup length write-array-header [ write-msgpack ] each ;

Support for maps of key/value pairs:

: write-map-header ( n -- )
    {
        { [ dup 0xf <= ] [ 0x80 bitor write1 ] }
        { [ dup 0xffff <= ] [ 0xde write1 2 >be write ] }
        { [ dup 0xffffffff <= ] [ 0xdf write1 4 >be write ] }
        [ cannot-convert ]
    } cond ;

M: assoc write-msgpack
    dup assoc-size write-map-header
    [ [ write-msgpack ] bi@ ] assoc-each ;

Convenience

To conveniently convert into and out of the MessagePack format, we can make words to read from and write to strings:

: msgpack> ( string -- obj )
    [ read-msgpack ] with-string-reader ;

: >msgpack ( obj -- string )
    [ write-msgpack ] with-string-writer ;

Not too hard, was it!

The code for this (including some documentation and tests) is available in the development version of Factor.