Thursday, November 28, 2013

tzfile

I have wanted to parse timezone information files (also known as "tzfile") for awhile. In particular, so that Factor can begin to support named timezones in a smarter way.

Parsing

The tzfile is a binary format file from the tz database (also known as the "zoneinfo database"). Each tzfile starts with the four magic bytes "TZif", which we can check:

ERROR: bad-magic ;

: check-magic ( -- )
    4 read "TZif" sequence= [ bad-magic ] unless ;

The tzfile then contains a header followed by a series of ttinfo structures and other information:

STRUCT: tzhead
    { tzh_reserved char[16] }
    { tzh_ttisgmtcnt be32 }
    { tzh_ttisstdcnt be32 }
    { tzh_leapcnt be32 }
    { tzh_timecnt be32 }
    { tzh_typecnt be32 }
    { tzh_charcnt be32 } ;

PACKED-STRUCT: ttinfo
    { tt_gmtoff be32 }
    { tt_isdst uchar }
    { tt_abbrind uchar } ;

We can store all the information parsed from the tzfile in a tuple:

TUPLE: tzfile header transition-times local-times types abbrevs
leaps is-std is-gmt ;

C: <tzfile> tzfile

With a helper word to read 32-bit big-endian numbers, we can parse the entire file:

: read-be32 ( -- n )
    4 read be32 deref ;

: read-tzfile ( -- tzfile )
    check-magic tzhead read-struct dup {
        [ tzh_timecnt>> [ read-be32 ] replicate ]
        [ tzh_timecnt>> [ read1 ] replicate ]
        [ tzh_typecnt>> [ ttinfo read-struct ] replicate ]
        [ tzh_charcnt>> read ]
        [ tzh_leapcnt>> [ read-be32 read-be32 2array ] replicate ]
        [ tzh_ttisstdcnt>> read ]
        [ tzh_ttisgmtcnt>> read ]
    } cleave <tzfile> ;

All of that data specifies a series of local time types and transition times:

TUPLE: local-time gmt-offset dst? abbrev std? gmt? ;

C: <local-time> local-time

TUPLE: transition seconds timestamp local-time ;

C: <transition> transition

The abbreviated local time names are stored in a flattened array. It would be helpful to parse them out into a hashtable where the key is the starting character index in the flattened array:

:: tznames ( abbrevs -- assoc )
    0 [
        0 over abbrevs index-from dup
    ] [
        [ dupd abbrevs subseq >string 2array ] keep 1 + swap
    ] produce 2nip >hashtable ;

We can now construct an array of all the transition times and the local time types they represent. This is a lot of logic for a typical Factor word, so we use local variables to make it easier to understand:

:: tzfile>transitions ( tzfile -- transitions )
    tzfile abbrevs>> tznames :> abbrevs
    tzfile is-std>> :> is-std
    tzfile is-gmt>> :> is-gmt
    tzfile types>> [
        [
            {
                [ tt_gmtoff>> seconds ]
                [ tt_isdst>> 1 = ]
                [ tt_abbrind>> abbrevs at ]
            } cleave
        ] dip
        [ is-std ?nth dup [ 1 = ] when ]
        [ is-gmt ?nth dup [ 1 = ] when ] bi <local-time>
    ] map-index :> local-times
    tzfile transition-times>>
    tzfile local-times>> [
        [ dup unix-time>timestamp ] [ local-times nth ] bi*
        <transition>
    ] 2map ;

We want to wrap the tzfile parsed structure and the transitions in a tzinfo object that can be used later with timestamps. These tzinfo objects are created by parsing from specific files by path or by their zoneinfo name:

TUPLE: tzinfo tzfile transitions ;

C: <tzinfo> tzinfo

: file>tzinfo ( path -- tzinfo )
    binary [
        read-tzfile dup tzfile>transitions <tzinfo>
    ] with-file-reader ;

: load-tzinfo ( name -- tzinfo )
    "/usr/share/zoneinfo/" prepend file>tzinfo ;

Timestamps

Now that we have the tzinfo, we can convert a UTC timestamp into the timezone specified by our tzfile. This is accomplished by finding the transition time that affects the requested timestamp and adjusting by the GMT offset that it represents:

: find-transition ( timestamp tzinfo -- transition )
    [ timestamp>unix-time ] [ transitions>> ] bi*
    [ [ seconds>> before? ] with find drop ]
    [ swap [ 1 [-] swap nth ] [ last ] if* ] bi ;

: from-utc ( timestamp tzinfo -- timestamp' )
    [ drop instant >>gmt-offset ]
    [ find-transition local-time>> gmt-offset>> ] 2bi
    convert-timezone ;

Or normalize a timestamp that might be in a different timezone into the timezone specified by our tzfile (converting into and then out of UTC):

: normalize ( timestamp tzinfo -- timestamp' )
    [ instant convert-timezone ] [ from-utc ] bi* ;

Example

An example of it working, taking a date in PST that is after a daylight savings transition, printing it out then subtracting 10 minutes and normalizing to the "US/Pacific" zoneinfo file, printing it out showing the time in PDT:

IN: scratchpad ! Take a time in PST
               2002 10 27 1 0 0 -8 hours <timestamp>

               ! Print it out
               dup "%c" strftime .
"Sun Oct 27 01:00:00 2002"

IN: scratchpad ! Subtract 10 minutes
               10 minutes time-

               ! Normalize to US-Pacific timezone
               "US/Pacific" load-tzinfo normalize

               ! Print it out
               "%c" strftime .
"Sun Oct 27 01:50:00 2002"

The code for this is available in the development version of Factor.

Monday, November 25, 2013

N-Numbers

In the United States, "N-Numbers" are the name given to aircraft registrations. Some of the services that the FAA provides include the ability to lookup aircraft by N-Number or reserve an N-Number.

Below we implement the rules to detect if a string is a valid N-Number in Factor.

  • may not begin with zero.
  • may not be the letters "I" or "O" to avoid confusion with the numbers one or zero.
: (n-number?) ( digits letters -- ? )
    [ dup first CHAR: 0 = [ drop f ] [ [ digit? ] all? ] if ]
    [ [ [ Letter? ] [ "IiOo" member? not ] bi and ] all? ]
    bi* and ;
  • may be one to five numbers (e.g., N12345).
  • may be one to four numbers and one suffix letter (e.g., N1A and N1234Z).
  • may be one to three numbers and two suffix letters (e.g., N24BY and N123AZ).
: n-number? ( str -- ? )
    "N" ?head drop {
        [ { [ length 1 5 between? ] [ f (n-number?) ] } 1&& ]
        [ { [ length 2 5 between? ] [ 1 cut* (n-number?) ] } 1&& ]
        [ { [ length 3 5 between? ] [ 2 cut* (n-number?) ] } 1&& ]
    } 1|| ;

Registration numbers N1 through N99 are reserved for Federal Aviation Administration (FAA) internal use and are not available.

: reserved? ( str -- ? )
    "N" ?head drop
    { [ length 1 2 between? ] [ [ digit? ] all? ] } 1&& ;

The code and some tests for this is on my GitHub.

Friday, November 22, 2013

tree

I've used Factor to build several common unix programs including copy, cat, fortune, wc, move, uniq, and others.

Today, I wanted to show how to build the tree program. If we look at the man page, we can see that it is used to:

Tree is a recursive directory listing program that produces a depth indented listing of files. With no arguments, tree lists the files in the current directory. When directory arguments are given, tree lists all the files and/or directories found in the given directories each in turn. Upon completion of listing all files/directories found, tree returns the total number of files and/or directories listed.

We need to keep track of files and directories that are traversed:

SYMBOL: #files
SYMBOL: #directories

Each file name is indented according to a specified depth:

: indent ( n -- )
    [ [ "|   " write ] times ] unless-zero "+-- " write ;

: write-name ( indent entry -- )
    [ indent ] [ name>> write ] bi* ;

File names are written and increment the #files counter:

: write-file ( indent entry -- )
    write-name #files [ 1 + ] change-global ;

Directory names are written, increase the indent, recurse writing their directory tree, and increment the #directories counter:

DEFER: write-tree

: write-dir ( indent entry -- )
    [ write-name ] [
        [ [ 1 + ] [ name>> ] bi* write-tree ]
        [ 3drop " [error opening dir]" write ] recover
    ] 2bi #directories [ 1 + ] change-global ;

Using write-file and write-dir, we can implement write-tree to sort the entries and then write each according to their type (e.g., file or directory):

: write-tree ( indent path -- )
    [
        [ name>> ] sort-with [
            nl [ dup ] bi@ type>> +directory+ =
            [ write-dir ] [ write-file ] if
        ] each drop
    ] with-directory-entries ;

Finally, we can implement the tree command, initializing the file and directory count, recursing on the path specified, and then printing out the number of files and directories observed:

: tree ( path -- )
    0 #directories set-global 0 #files set-global
    [ write ] [ 0 swap write-tree ] bi nl
    #directories get-global #files get-global
    "\n%d directories, %d files\n" printf ;

Our command-line word will either operate on the arguments specifying a list of directories to process, or the current directory if none are provided:

: run-tree ( -- )
    command-line get [
        current-directory get tree
    ] [
        [ tree ] each
    ] if-empty ;

MAIN: run-tree

This is available on my GitHub.

Monday, November 18, 2013

MessagePack

MessagePack is an "efficient binary serialization format" designed to be faster and smaller than JSON and has support in many programming languages.

Less than 150 lines of code!

What is that, you ask? Why thats the number of lines of code it took to implement a MessagePack encoder and decoder in Factor.

Reading

For decoding, our strategy will be to create a read-msgpack word that can operate on the current input-stream (allowing reuse for MessagePack objects read from files, strings, or the network).

DEFER: read-msgpack

Aside from support for basic data types such as integers, floating-point numbers, and strings, we also need to support arrays of objects, maps of key/value pairs, and so-called "extended" object types:

: read-array ( n -- obj )
    [ read-msgpack ] replicate ;

: read-map ( n -- obj )
    2 * read-array 2 group >hashtable ;

: read-ext ( n -- obj )
    read be> [ 1 read signed-be> ] dip read 2array ;

We need a way to specify a "nil" (or "null") object since we use t and f for booleans:

SINGLETON: +msgpack-nil+

And, of course, an error to indicate when a requested format is not supported:

ERROR: unknown-format n ;

With those definitions done, we can build a word to read a single MessagePack object from a stream:

: read-msgpack ( -- obj )
    read1 {
        { [ dup 0xc0 = ] [ drop +msgpack-nil+ ] }
        { [ dup 0xc2 = ] [ drop f ] }
        { [ dup 0xc3 = ] [ drop t ] }
        { [ dup 0x00 0x7f between? ] [ ] }
        { [ dup 0xe0 mask? ] [ 1array signed-be> ] }
        { [ dup 0xcc = ] [ drop read1 ] }
        { [ dup 0xcd = ] [ drop 2 read be> ] }
        { [ dup 0xce = ] [ drop 4 read be> ] }
        { [ dup 0xcf = ] [ drop 8 read be> ] }
        { [ dup 0xd0 = ] [ drop 1 read signed-be> ] }
        { [ dup 0xd1 = ] [ drop 2 read signed-be> ] }
        { [ dup 0xd2 = ] [ drop 4 read signed-be> ] }
        { [ dup 0xd3 = ] [ drop 8 read signed-be> ] }
        { [ dup 0xca = ] [ drop 4 read be> bits>float ] }
        { [ dup 0xcb = ] [ drop 8 read be> bits>double ] }
        { [ dup 0xe0 mask 0xa0 = ] [ 0x1f mask read ] }
        { [ dup 0xd9 = ] [ drop read1 read "" like ] }
        { [ dup 0xda = ] [ drop 2 read be> read "" like ] }
        { [ dup 0xdb = ] [ drop 4 read be> read "" like ] }
        { [ dup 0xc4 = ] [ drop read1 read B{ } like ] }
        { [ dup 0xc5 = ] [ drop 2 read be> read B{ } like ] }
        { [ dup 0xc6 = ] [ drop 4 read be> read B{ } like ] }
        { [ dup 0xf0 mask 0x90 = ] [ 0x0f mask read-array ] }
        { [ dup 0xdc = ] [ drop 2 read be> read-array ] }
        { [ dup 0xdd = ] [ drop 4 read be> read-array ] }
        { [ dup 0xf0 mask 0x80 = ] [ 0x0f mask read-map ] }
        { [ dup 0xde = ] [ drop 2 read be> read-map ] }
        { [ dup 0xdf = ] [ drop 4 read be> read-map ] }
        { [ dup 0xd4 = ] [ drop 1 read-ext ] }
        { [ dup 0xd5 = ] [ drop 2 read-ext ] }
        { [ dup 0xd6 = ] [ drop 4 read-ext ] }
        { [ dup 0xd7 = ] [ drop 8 read-ext ] }
        { [ dup 0xd8 = ] [ drop 16 read-ext ] }
        { [ dup 0xc7 = ] [ drop read1 read-ext ] }
        { [ dup 0xc8 = ] [ drop 2 read be> read-ext ] }
        { [ dup 0xc9 = ] [ drop 4 read be> read-ext ] }
        [ unknown-format ]
    } cond ;

Pretty simple!

Writing

For encoding, our strategy will be to define a generic write-msgpack word that will dispatch off the type of object being encoded and operate on the current output-stream (allowing reuse for MessagePack objects written to files, strings, or the network).

GENERIC: write-msgpack ( obj -- )

And, of course, an error to indicate when a requested object type isn't supported:

ERROR: cannot-convert obj ;

Writing the "nil" (or "null" object) and boolean values true and false:

M: +msgpack-nil+ write-msgpack drop 0xc0 write1 ;

M: f write-msgpack drop 0xc2 write1 ;

M: t write-msgpack drop 0xc3 write1 ;

Support for integers and floating point numbers:

M: integer write-msgpack
    dup 0 >= [
        {
            { [ dup 0x7f <= ] [ write1 ] }
            { [ dup 0xff <= ] [ 0xcc write1 1 >be write ] }
            { [ dup 0xffff <= ] [ 0xcd write1 2 >be write ] }
            { [ dup 0xffffffff <= ] [ 0xce write1 4 >be write ] }
            { [ dup 0xffffffffffffffff <= ] 
                [ 0xcf write1 8 >be write ] }
            [ cannot-convert ]
        } cond
    ] [
        {
            { [ dup -0x1f >= ] [ 1 >be write ] }
            { [ dup -0x80 >= ] [ 0xd0 write1 1 >be write ] }
            { [ dup -0x8000 >= ] [ 0xd1 write1 2 >be write ] }
            { [ dup -0x80000000 >= ] [ 0xd2 write1 4 >be write ] }
            { [ dup -0x8000000000000000 >= ] 
                [ 0xd3 write1 8 >be write ] }
            [ cannot-convert ]
        } cond
    ] if ;

M: float write-msgpack
    0xcb write1 double>bits 8 >be write ;

Support for strings and byte-arrays:

M: string write-msgpack
    dup length {
        { [ dup 0x1f <= ] [ 0xa0 bitor write1 ] }
        { [ dup 0xff <= ] [ 0xd9 write1 write1 ] }
        { [ dup 0xffff <= ] [ 0xda write1 2 >be write ] }
        { [ dup 0xffffffff <= ] [ 0xdb write1 4 >be write ] }
        [ cannot-convert ]
    } cond write ;

M: byte-array write-msgpack
    dup length {
        { [ dup 0xff <= ] [ 0xc4 write1 write1 ] }
        { [ dup 0xffff <= ] [ 0xc5 write1 2 >be write ] }
        { [ dup 0xffffffff <= ] [ 0xc6 write1 4 >be write ] }
        [ cannot-convert ]
    } cond write ;

Support for arrays of MessagePack objects:

: write-array-header ( n -- )
    {
        { [ dup 0xf <= ] [ 0x90 bitor write1 ] }
        { [ dup 0xffff <= ] [ 0xdc write1 2 >be write ] }
        { [ dup 0xffffffff <= ] [ 0xdd write1 4 >be write ] }
        [ cannot-convert ]
    } cond ;

M: sequence write-msgpack
    dup length write-array-header [ write-msgpack ] each ;

Support for maps of key/value pairs:

: write-map-header ( n -- )
    {
        { [ dup 0xf <= ] [ 0x80 bitor write1 ] }
        { [ dup 0xffff <= ] [ 0xde write1 2 >be write ] }
        { [ dup 0xffffffff <= ] [ 0xdf write1 4 >be write ] }
        [ cannot-convert ]
    } cond ;

M: assoc write-msgpack
    dup assoc-size write-map-header
    [ [ write-msgpack ] bi@ ] assoc-each ;

Convenience

To conveniently convert into and out of the MessagePack format, we can make words to read from and write to strings:

: msgpack> ( string -- obj )
    [ read-msgpack ] with-string-reader ;

: >msgpack ( obj -- string )
    [ write-msgpack ] with-string-writer ;

Not too hard, was it!

The code for this (including some documentation and tests) is available in the development version of Factor.

Saturday, November 9, 2013

ChuckNorris is a Color

I just came across a fun question asking why does HTML think 'chucknorris' is a color on StackOverflow. This is called "Flex Hex" and is described in detail in a little rant about Microsoft Internet Explorer's color parsing.

It turns out these are all valid colors:

chucknorrissickcrap

I thought it would be fun to support parsing colors in this (arguably broken) format in Factor. The algorithm basically breaks down into three parts:

  1. Convert non-hexadecimal digits to zero.

    : hex-only ( str -- str' )
        [ dup hex-digit? [ drop CHAR: 0 ] unless ] map ;
  2. Group into three equal groups, padding on the right with zero if necessary.

    : pad-length ( str -- n )
        length dup 3 mod [ 3 swap - + ] unless-zero ;
    
    : three-groups ( str -- array )
        dup pad-length [ CHAR: 0 pad-tail ] [ 3 / group ] bi ;
  3. Convert each segment into a two-digit hexadecimal value, shortening each segment first to eight chars from the right, padding on the left if only one character.

    : hex-rgb ( array -- array' )
        [
            8 short tail*
            2 short head
            2 CHAR: 0 pad-head
        ] map ;

Putting that together, we have this word to parse "flex hex" colors (removing hash-marks from the left if present):

: flex-hex ( str -- hex )
    "#" ?head drop hex-only three-groups hex-rgb "" join ;

And, of course, some tests to verify that we handle lots of different cases:

{ "00b000" } [ "#zqbttv" flex-hex ] unit-test

{ "0f0000" } [ "f" flex-hex ] unit-test
{ "000f00" } [ "0f" flex-hex ] unit-test
{ "000f00" } [ "0f0" flex-hex ] unit-test
{ "0f0f00" } [ "0f0f" flex-hex ] unit-test
{ "0ff000" } [ "0f0f0f0" flex-hex ] unit-test

{ "ad0e0e" } [ "adamlevine" flex-hex ] unit-test
{ "000000" } [ "MrT" flex-hex ] unit-test
{ "00c000" } [ "sick" flex-hex ] unit-test
{ "c0a000" } [ "crap" flex-hex ] unit-test
{ "c00000" } [ "chucknorris" flex-hex ] unit-test

{ "6ecde0" } [
    "6db6ec49efd278cd0bc92d1e5e072d68" flex-hex
] unit-test

The code for this is on my GitHub.

Monday, November 4, 2013

Fast Factorial

You might remember that I implemented a bunch of factorial words several months ago.

Recently, I was looking at Python's latest mathmodule.c, and noticed an implementation of a "divide-and-conquer factorial algorithm". The algorithm is based on one described by Peter Luschny as the binary-split formula of the factorial of n. It seemed like a fun thing to implement and benchmark in Factor.

It turns out this was one of the improvements made during the Python 3.2 development cycle. Calculating the 50,000th factorial takes 0.724 seconds with Python 2.7.5 and only 0.064 seconds with Python 3.3.2 - a nice improvement!

slow-factorial

The most basic factorial would just multiply the numbers from 1 to n:

: slow-factorial ( n -- n! )
    dup 1 > [ [1,b] 1 [ * ] reduce ] [ drop 1 ] if ;

Calculating the 50,000 factorial this way takes 0.915 seconds.

factorial

It turns out that doing that causes a lot of arbitrary-precision bignum multiplication. We can instead use binary-reduce (used by the standard library product and sum words), which is a generic algorithm that recursively multiplies groups of numbers which allows for more fixnum multiplications.

: factorial ( n -- n! )
    dup 1 > [ [1,b] 1 [ * ] binary-reduce ] [ drop 1 ] if ;

Calculating the 50,000 factorial takes 0.217 seconds.

fast-factorial

We can implement Peter Luschny's algorithm, keeping the structure very close to his sample code:

:: part-product ( n m -- x )
    {
        { [ m n 1 + <= ] [ n ] }
        { [ m n 2 + = ] [ n m * ] }
        [
            n m + 2 /i dup even? [ 1 - ] when :> k
            n k k 2 + m [ part-product ] 2bi@ *
        ]
    } cond ;

:: factorial-loop ( n p r -- p' r' )
    n 2 > [
        n 2 /i :> mid
        mid p r factorial-loop
        [
            mid 1 + mid 1 bitand +
            n 1 - n 1 bitand + part-product *
        ] [ dupd * ] bi*
    ] [ p r ] if ;

: fast-factorial ( x -- n )
    [ 1 1 factorial-loop nip ] [ dup bit-count - ] bi shift ;

Calculating the 50,000 factorial now only takes 0.171 seconds!

The code for this is on my GitHub.

Note: there are other algorithms that are likely faster, but for this purpose I wanted to implement a similar algorithm to Python. Their implementation has a few modifications including a tight loop for multiplications that fit into an unsigned long and is over 150 lines of code.