## Friday, December 13, 2013

### Monte Carlo

The Monte Carlo method is a method of estimation that uses random simulation to solve physical or mathematical problems. Named after the Monte Carlo Casino, you can think of the method as playing a game of chance many times and recording the outcomes (such as how frequently one wins or loses).

One classic example from the Wikipedia article is estimating the value of π (although there are many other ways to approximate the value of π):

For example, consider a circle inscribed in a unit square. Given that the circle and the square have a ratio of areas that is π/4, the value of π can be approximated using a Monte Carlo method:

1. Draw a square on the ground, then inscribe a circle within it.
2. Uniformly scatter some objects of uniform size (grains of rice or sand) over the square.
3. Count the number of objects inside the circle and the total number of objects.
4. The ratio of the two counts is an estimate of the ratio of the two areas, which is π/4. Multiply the result by 4 to estimate π.

You can visualize how this works by seeing the estimate get more correct as you increase the number of points:

We can generate a random point inside the unit circle by generating two random-unit values:

```: random-point ( -- x y )
random-unit random-unit ;```

Using the Pythagorean theorem, we can calculate the distance from the zero point. If the distance is less than or equal to 1.0, then it is inside the circle:

```: inside-circle? ( x y -- ? )
[ sq ] bi@ + sqrt 1.0 <= ;```

We can then estimate the value of π by computing a number of points, taking the percentage that are inside the circle and multiplying by 4:

```: estimate-pi ( points -- pi-estimate )
0 swap [
[ random-point inside-circle? [ 1 + ] when ] times
] keep /f 4 * ;```

We can run this for varying numbers of points and see how it gets more accurate:

```IN: scratchpad { 100 1,000 10,000 100,000 1,000,000 10,000,000 }
[ estimate-pi . ] each
3.2
3.168
3.162
3.15176
3.14288
3.1418212```

## Tuesday, December 10, 2013

### UU Encoding

Just a quick note, as of a couple months ago, Factor has support for uuencoding (and uudecoding)!

You can perform a uuencode:

```IN: scratchpad "Factor" string>uu print
begin
&1F%C=&]R
end```

...and also a uudecode:

```IN: scratchpad """
begin 644 factor.txt
&1F%C=&]R
end
""" uu>string .
"Factor"```

Right now, it operates on text directly and doesn't preserve the file name and permissions in the `begin` header, but that would be an easy improvement.

The code for this is available in the development version of Factor.

## Thursday, December 5, 2013

### Humanhash

Zachary Voase published the humanhash project on GitHub, making "human-readable representations of digests". Below is a compatible implementation in Factor.

To show how it will work, we use the example from his humanhash README:

```IN: scratchpad CONSTANT: digest "7528880a986c40e78c38115e640da2a1"

"three-georgia-xray-jig"

IN: scratchpad digest 6 humanhash-words .
"high-mango-white-oregon-purple-charlie"

{
"28129036-75a7-4c87-984b-4b32231e0a0d"
"nineteen-bluebird-oxygen-edward"
}```

## Implementation

We need a list of 256 words, one to represent each possible byte:

```CONSTANT: default-wordlist {
"ack" "alabama" "alanine" "alaska" "alpha" "angel" "apart"
"april" "arizona" "arkansas" "artist" "asparagus" "aspen"
"august" "autumn" "avocado" "bacon" "bakerloo" "batman" "beer"
"berlin" "beryllium" "black" "blossom" "blue" "bluebird" "bravo"
"bulldog" "burger" "butter" "california" "carbon" "cardinal"
"carolina" "carpet" "cat" "ceiling" "charlie" "chicken" "coffee"
"cola" "cold" "colorado" "comet" "connecticut" "crazy" "cup"
"dakota" "december" "delaware" "delta" "diet" "don" "double"
"early" "earth" "east" "echo" "edward" "eight" "eighteen"
"eleven" "emma" "enemy" "equal" "failed" "fanta" "fifteen"
"fillet" "finch" "fish" "five" "fix" "floor" "florida"
"football" "four" "fourteen" "foxtrot" "freddie" "friend"
"fruit" "gee" "georgia" "glucose" "golf" "green" "grey" "hamper"
"happy" "harry" "hawaii" "helium" "high" "hot" "hotel"
"hydrogen" "idaho" "illinois" "india" "indigo" "ink" "iowa"
"island" "item" "jersey" "jig" "johnny" "juliet" "july"
"jupiter" "kansas" "kentucky" "kilo" "king" "kitten" "lactose"
"lake" "lamp" "lemon" "leopard" "lima" "lion" "lithium" "london"
"louisiana" "low" "magazine" "magnesium" "maine" "mango" "march"
"mars" "maryland" "massachusetts" "may" "mexico" "michigan"
"mike" "minnesota" "mirror" "mississippi" "missouri" "mobile"
"mockingbird" "monkey" "montana" "moon" "mountain" "muppet"
"nineteen" "nitrogen" "north" "november" "nuts" "october" "ohio"
"oklahoma" "one" "orange" "oranges" "oregon" "oscar" "oven"
"oxygen" "papa" "paris" "pasta" "pennsylvania" "pip" "pizza"
"pluto" "potato" "princess" "purple" "quebec" "queen" "quiet"
"red" "river" "robert" "robin" "romeo" "rugby" "sad" "salami"
"saturn" "september" "seven" "seventeen" "shade" "sierra"
"single" "sink" "six" "sixteen" "skylark" "snake" "social"
"sodium" "solar" "south" "spaghetti" "speaker" "spring"
"stairway" "steak" "stream" "summer" "sweet" "table" "tango"
"ten" "tennessee" "tennis" "texas" "thirteen" "three" "timing"
"triple" "twelve" "twenty" "two" "uncle" "undress" "uniform"
"uranus" "utah" "vegan" "venus" "vermont" "victor" "video"
"violet" "virginia" "washington" "west" "whiskey" "white"
"william" "winner" "winter" "wisconsin" "wolfram" "wyoming"
"xray" "yankee" "yellow" "zebra" "zulu"
}```

One of the inputs is the number of words to produce. An error is produced if fewer bytes are provided than the number of words requested:

```ERROR: too-few-bytes seq #words ;

: check-bytes ( seq #words -- seq #words )
2dup [ length ] [ < ] bi* [ too-few-bytes ] when ; inline```

Input is grouped into subsequences, where the number of subsequences is the number of words requested in the output. It's a little bit odd, but basically makes groups and then puts any remainder in the last group:

```: group-words ( seq #words -- groups )
[ dupd [ length ] [ /i ] bi* group ]
[ 1 - cut concat suffix ] bi ; inline```

Our input bytes are compressed, first by grouping them into words, then by XORing the bytes in each word:

```: compress-bytes ( seq #words -- newseq )
check-bytes group-words [ 0 [ bitxor ] reduce ] map ;```

Our input will either be a byte-array, or a hex-string with every two characters representing the hexadecimal value of each byte:

```: byte-string ( hexdigest -- seq )
dup byte-array? [ 2 <groups> [ hex> ] map ] unless ;```

Making a humanhash is simply converting the input, compressing into bytes representing each word, looking up the word from the word list, and joining with a requested separator:

```: make-humanhash ( hexdigest #words wordlist sep -- hash )
{ [ byte-string ] [ compress-bytes ] [ nths ] [ join ] } spread ;```

We provide a way to hash into a requested number of words, or four by default:

```: humanhash-words ( hexdigest #words -- hash )
default-wordlist "-" make-humanhash ;

: humanhash ( hexdigest -- hash )
4 humanhash-words ;```

And since the humanhash project includes a way to create humanhash'd uuids, we do also:

```: human-uuid4 ( -- uuid hash )
uuid4 dup [ CHAR: - = not ] filter humanhash ;```

## Thursday, November 28, 2013

### tzfile

I have wanted to parse timezone information files (also known as "tzfile") for awhile. In particular, so that Factor can begin to support named timezones in a smarter way.

## Parsing

The tzfile is a binary format file from the tz database (also known as the "zoneinfo database"). Each tzfile starts with the four magic bytes "`TZif`", which we can check:

```ERROR: bad-magic ;

: check-magic ( -- )

The tzfile then contains a header followed by a series of `ttinfo` structures and other information:

```STRUCT: tzhead
{ tzh_reserved char[16] }
{ tzh_ttisgmtcnt be32 }
{ tzh_ttisstdcnt be32 }
{ tzh_leapcnt be32 }
{ tzh_timecnt be32 }
{ tzh_typecnt be32 }
{ tzh_charcnt be32 } ;

PACKED-STRUCT: ttinfo
{ tt_gmtoff be32 }
{ tt_isdst uchar }
{ tt_abbrind uchar } ;```

We can store all the information parsed from the tzfile in a tuple:

```TUPLE: tzfile header transition-times local-times types abbrevs
leaps is-std is-gmt ;

C: <tzfile> tzfile```

With a helper word to read 32-bit big-endian numbers, we can parse the entire file:

```: read-be32 ( -- n )

: read-tzfile ( -- tzfile )
[ tzh_timecnt>> [ read-be32 ] replicate ]
[ tzh_timecnt>> [ read1 ] replicate ]
[ tzh_typecnt>> [ ttinfo read-struct ] replicate ]
} cleave <tzfile> ;```

All of that data specifies a series of local time types and transition times:

```TUPLE: local-time gmt-offset dst? abbrev std? gmt? ;

C: <local-time> local-time

TUPLE: transition seconds timestamp local-time ;

C: <transition> transition```

The abbreviated local time names are stored in a flattened array. It would be helpful to parse them out into a hashtable where the key is the starting character index in the flattened array:

```:: tznames ( abbrevs -- assoc )
0 [
0 over abbrevs index-from dup
] [
[ dupd abbrevs subseq >string 2array ] keep 1 + swap
] produce 2nip >hashtable ;```

We can now construct an array of all the transition times and the local time types they represent. This is a lot of logic for a typical Factor word, so we use local variables to make it easier to understand:

```:: tzfile>transitions ( tzfile -- transitions )
tzfile abbrevs>> tznames :> abbrevs
tzfile is-std>> :> is-std
tzfile is-gmt>> :> is-gmt
tzfile types>> [
[
{
[ tt_gmtoff>> seconds ]
[ tt_isdst>> 1 = ]
[ tt_abbrind>> abbrevs at ]
} cleave
] dip
[ is-std ?nth dup [ 1 = ] when ]
[ is-gmt ?nth dup [ 1 = ] when ] bi <local-time>
] map-index :> local-times
tzfile transition-times>>
tzfile local-times>> [
[ dup unix-time>timestamp ] [ local-times nth ] bi*
<transition>
] 2map ;```

We want to wrap the `tzfile` parsed structure and the transitions in a `tzinfo` object that can be used later with timestamps. These `tzinfo` objects are created by parsing from specific files by path or by their zoneinfo name:

```TUPLE: tzinfo tzfile transitions ;

C: <tzinfo> tzinfo

: file>tzinfo ( path -- tzinfo )
binary [

: load-tzinfo ( name -- tzinfo )
"/usr/share/zoneinfo/" prepend file>tzinfo ;```

## Timestamps

Now that we have the `tzinfo`, we can convert a UTC timestamp into the timezone specified by our tzfile. This is accomplished by finding the transition time that affects the requested timestamp and adjusting by the GMT offset that it represents:

```: find-transition ( timestamp tzinfo -- transition )
[ timestamp>unix-time ] [ transitions>> ] bi*
[ [ seconds>> before? ] with find drop ]
[ swap [ 1 [-] swap nth ] [ last ] if* ] bi ;

: from-utc ( timestamp tzinfo -- timestamp' )
[ drop instant >>gmt-offset ]
[ find-transition local-time>> gmt-offset>> ] 2bi
convert-timezone ;```

Or normalize a timestamp that might be in a different timezone into the timezone specified by our tzfile (converting into and then out of UTC):

```: normalize ( timestamp tzinfo -- timestamp' )
[ instant convert-timezone ] [ from-utc ] bi* ;
```

## Example

An example of it working, taking a date in PST that is after a daylight savings transition, printing it out then subtracting 10 minutes and normalizing to the "US/Pacific" zoneinfo file, printing it out showing the time in PDT:

```IN: scratchpad ! Take a time in PST
2002 10 27 1 0 0 -8 hours <timestamp>

! Print it out
dup "%c" strftime .
"Sun Oct 27 01:00:00 2002"

IN: scratchpad ! Subtract 10 minutes
10 minutes time-

! Normalize to US-Pacific timezone

! Print it out
"%c" strftime .
"Sun Oct 27 01:50:00 2002"```

The code for this is available in the development version of Factor.

## Monday, November 25, 2013

### N-Numbers

In the United States, "N-Numbers" are the name given to aircraft registrations. Some of the services that the FAA provides include the ability to lookup aircraft by N-Number or reserve an N-Number.

Below we implement the rules to detect if a string is a valid N-Number in Factor.

• may not begin with zero.
• may not be the letters "I" or "O" to avoid confusion with the numbers one or zero.
```: (n-number?) ( digits letters -- ? )
[ dup first CHAR: 0 = [ drop f ] [ [ digit? ] all? ] if ]
[ [ [ Letter? ] [ "IiOo" member? not ] bi and ] all? ]
bi* and ;```
• may be one to five numbers (e.g., N12345).
• may be one to four numbers and one suffix letter (e.g., N1A and N1234Z).
• may be one to three numbers and two suffix letters (e.g., N24BY and N123AZ).
```: n-number? ( str -- ? )
[ { [ length 1 5 between? ] [ f (n-number?) ] } 1&& ]
[ { [ length 2 5 between? ] [ 1 cut* (n-number?) ] } 1&& ]
[ { [ length 3 5 between? ] [ 2 cut* (n-number?) ] } 1&& ]
} 1|| ;```

Registration numbers N1 through N99 are reserved for Federal Aviation Administration (FAA) internal use and are not available.

```: reserved? ( str -- ? )
{ [ length 1 2 between? ] [ [ digit? ] all? ] } 1&& ;```

The code and some tests for this is on my GitHub.

## Friday, November 22, 2013

### tree

I've used Factor to build several common unix programs including copy, cat, fortune, wc, move, uniq, and others.

Today, I wanted to show how to build the `tree` program. If we look at the man page, we can see that it is used to:

`Tree` is a recursive directory listing program that produces a depth indented listing of files. With no arguments, `tree` lists the files in the current directory. When directory arguments are given, `tree` lists all the files and/or directories found in the given directories each in turn. Upon completion of listing all files/directories found, `tree` returns the total number of files and/or directories listed.

We need to keep track of files and directories that are traversed:

```SYMBOL: #files
SYMBOL: #directories```

Each file name is indented according to a specified depth:

```: indent ( n -- )
[ [ "|   " write ] times ] unless-zero "+-- " write ;

: write-name ( indent entry -- )
[ indent ] [ name>> write ] bi* ;```

File names are written and increment the `#files` counter:

```: write-file ( indent entry -- )
write-name #files [ 1 + ] change-global ;```

Directory names are written, increase the indent, recurse writing their directory tree, and increment the `#directories` counter:

```DEFER: write-tree

: write-dir ( indent entry -- )
[ write-name ] [
[ [ 1 + ] [ name>> ] bi* write-tree ]
[ 3drop " [error opening dir]" write ] recover
] 2bi #directories [ 1 + ] change-global ;```

Using `write-file` and `write-dir`, we can implement `write-tree` to sort the entries and then write each according to their type (e.g., file or directory):

```: write-tree ( indent path -- )
[
[ name>> ] sort-with [
nl [ dup ] bi@ type>> +directory+ =
[ write-dir ] [ write-file ] if
] each drop
] with-directory-entries ;```

Finally, we can implement the `tree` command, initializing the file and directory count, recursing on the path specified, and then printing out the number of files and directories observed:

```: tree ( path -- )
0 #directories set-global 0 #files set-global
[ write ] [ 0 swap write-tree ] bi nl
#directories get-global #files get-global
"\n%d directories, %d files\n" printf ;```

Our command-line word will either operate on the arguments specifying a list of directories to process, or the current directory if none are provided:

```: run-tree ( -- )
command-line get [
current-directory get tree
] [
[ tree ] each
] if-empty ;

MAIN: run-tree
```

This is available on my GitHub.

## Monday, November 18, 2013

### MessagePack

MessagePack is an "efficient binary serialization format" designed to be faster and smaller than JSON and has support in many programming languages.

Less than 150 lines of code!

What is that, you ask? Why thats the number of lines of code it took to implement a MessagePack encoder and decoder in Factor.

For decoding, our strategy will be to create a `read-msgpack` word that can operate on the current input-stream (allowing reuse for MessagePack objects read from files, strings, or the network).

`DEFER: read-msgpack`

Aside from support for basic data types such as integers, floating-point numbers, and strings, we also need to support arrays of objects, maps of key/value pairs, and so-called "extended" object types:

```: read-array ( n -- obj )

: read-map ( n -- obj )
2 * read-array 2 group >hashtable ;

: read-ext ( n -- obj )

We need a way to specify a "nil" (or "null") object since we use `t` and `f` for booleans:

`SINGLETON: +msgpack-nil+`

And, of course, an error to indicate when a requested format is not supported:

`ERROR: unknown-format n ;`

With those definitions done, we can build a word to read a single MessagePack object from a stream:

```: read-msgpack ( -- obj )
{ [ dup 0xc0 = ] [ drop +msgpack-nil+ ] }
{ [ dup 0xc2 = ] [ drop f ] }
{ [ dup 0xc3 = ] [ drop t ] }
{ [ dup 0x00 0x7f between? ] [ ] }
{ [ dup 0xe0 mask? ] [ 1array signed-be> ] }
{ [ dup 0xcc = ] [ drop read1 ] }
{ [ dup 0xcd = ] [ drop 2 read be> ] }
{ [ dup 0xce = ] [ drop 4 read be> ] }
{ [ dup 0xcf = ] [ drop 8 read be> ] }
{ [ dup 0xd0 = ] [ drop 1 read signed-be> ] }
{ [ dup 0xd1 = ] [ drop 2 read signed-be> ] }
{ [ dup 0xd2 = ] [ drop 4 read signed-be> ] }
{ [ dup 0xd3 = ] [ drop 8 read signed-be> ] }
{ [ dup 0xca = ] [ drop 4 read be> bits>float ] }
{ [ dup 0xcb = ] [ drop 8 read be> bits>double ] }
{ [ dup 0xd9 = ] [ drop read1 read "" like ] }
{ [ dup 0xda = ] [ drop 2 read be> read "" like ] }
{ [ dup 0xdb = ] [ drop 4 read be> read "" like ] }
{ [ dup 0xc4 = ] [ drop read1 read B{ } like ] }
{ [ dup 0xc5 = ] [ drop 2 read be> read B{ } like ] }
{ [ dup 0xc6 = ] [ drop 4 read be> read B{ } like ] }
{ [ dup 0xdc = ] [ drop 2 read be> read-array ] }
{ [ dup 0xdd = ] [ drop 4 read be> read-array ] }
{ [ dup 0xde = ] [ drop 2 read be> read-map ] }
{ [ dup 0xdf = ] [ drop 4 read be> read-map ] }
{ [ dup 0xd4 = ] [ drop 1 read-ext ] }
{ [ dup 0xd5 = ] [ drop 2 read-ext ] }
{ [ dup 0xd6 = ] [ drop 4 read-ext ] }
{ [ dup 0xd7 = ] [ drop 8 read-ext ] }
{ [ dup 0xd8 = ] [ drop 16 read-ext ] }
{ [ dup 0xc7 = ] [ drop read1 read-ext ] }
{ [ dup 0xc8 = ] [ drop 2 read be> read-ext ] }
{ [ dup 0xc9 = ] [ drop 4 read be> read-ext ] }
[ unknown-format ]
} cond ;```

Pretty simple!

## Writing

For encoding, our strategy will be to define a generic `write-msgpack` word that will dispatch off the type of object being encoded and operate on the current output-stream (allowing reuse for MessagePack objects written to files, strings, or the network).

`GENERIC: write-msgpack ( obj -- )`

And, of course, an error to indicate when a requested object type isn't supported:

`ERROR: cannot-convert obj ;`

Writing the "nil" (or "null" object) and boolean values true and false:

```M: +msgpack-nil+ write-msgpack drop 0xc0 write1 ;

M: f write-msgpack drop 0xc2 write1 ;

M: t write-msgpack drop 0xc3 write1 ;```

Support for integers and floating point numbers:

```M: integer write-msgpack
dup 0 >= [
{
{ [ dup 0x7f <= ] [ write1 ] }
{ [ dup 0xff <= ] [ 0xcc write1 1 >be write ] }
{ [ dup 0xffff <= ] [ 0xcd write1 2 >be write ] }
{ [ dup 0xffffffff <= ] [ 0xce write1 4 >be write ] }
{ [ dup 0xffffffffffffffff <= ]
[ 0xcf write1 8 >be write ] }
[ cannot-convert ]
} cond
] [
{
{ [ dup -0x1f >= ] [ 1 >be write ] }
{ [ dup -0x80 >= ] [ 0xd0 write1 1 >be write ] }
{ [ dup -0x8000 >= ] [ 0xd1 write1 2 >be write ] }
{ [ dup -0x80000000 >= ] [ 0xd2 write1 4 >be write ] }
{ [ dup -0x8000000000000000 >= ]
[ 0xd3 write1 8 >be write ] }
[ cannot-convert ]
} cond
] if ;

M: float write-msgpack
0xcb write1 double>bits 8 >be write ;```

Support for strings and byte-arrays:

```M: string write-msgpack
dup length {
{ [ dup 0x1f <= ] [ 0xa0 bitor write1 ] }
{ [ dup 0xff <= ] [ 0xd9 write1 write1 ] }
{ [ dup 0xffff <= ] [ 0xda write1 2 >be write ] }
{ [ dup 0xffffffff <= ] [ 0xdb write1 4 >be write ] }
[ cannot-convert ]
} cond write ;

M: byte-array write-msgpack
dup length {
{ [ dup 0xff <= ] [ 0xc4 write1 write1 ] }
{ [ dup 0xffff <= ] [ 0xc5 write1 2 >be write ] }
{ [ dup 0xffffffff <= ] [ 0xc6 write1 4 >be write ] }
[ cannot-convert ]
} cond write ;```

Support for arrays of MessagePack objects:

```: write-array-header ( n -- )
{
{ [ dup 0xf <= ] [ 0x90 bitor write1 ] }
{ [ dup 0xffff <= ] [ 0xdc write1 2 >be write ] }
{ [ dup 0xffffffff <= ] [ 0xdd write1 4 >be write ] }
[ cannot-convert ]
} cond ;

M: sequence write-msgpack
dup length write-array-header [ write-msgpack ] each ;```

Support for maps of key/value pairs:

```: write-map-header ( n -- )
{
{ [ dup 0xf <= ] [ 0x80 bitor write1 ] }
{ [ dup 0xffff <= ] [ 0xde write1 2 >be write ] }
{ [ dup 0xffffffff <= ] [ 0xdf write1 4 >be write ] }
[ cannot-convert ]
} cond ;

M: assoc write-msgpack
[ [ write-msgpack ] bi@ ] assoc-each ;```

## Convenience

To conveniently convert into and out of the MessagePack format, we can make words to read from and write to strings:

```: msgpack> ( string -- obj )