Friday, December 13, 2013

Monte Carlo

The Monte Carlo method is a method of estimation that uses random simulation to solve physical or mathematical problems. Named after the Monte Carlo Casino, you can think of the method as playing a game of chance many times and recording the outcomes (such as how frequently one wins or loses).

One classic example from the Wikipedia article is estimating the value of π (although there are many other ways to approximate the value of π):

For example, consider a circle inscribed in a unit square. Given that the circle and the square have a ratio of areas that is π/4, the value of π can be approximated using a Monte Carlo method:

  1. Draw a square on the ground, then inscribe a circle within it.
  2. Uniformly scatter some objects of uniform size (grains of rice or sand) over the square.
  3. Count the number of objects inside the circle and the total number of objects.
  4. The ratio of the two counts is an estimate of the ratio of the two areas, which is π/4. Multiply the result by 4 to estimate π.

You can visualize how this works by seeing the estimate get more correct as you increase the number of points:

We can generate a random point inside the unit circle by generating two random-unit values:

: random-point ( -- x y )
    random-unit random-unit ;

Using the Pythagorean theorem, we can calculate the distance from the zero point. If the distance is less than or equal to 1.0, then it is inside the circle:

: inside-circle? ( x y -- ? )
    [ sq ] bi@ + sqrt 1.0 <= ;

We can then estimate the value of π by computing a number of points, taking the percentage that are inside the circle and multiplying by 4:

: estimate-pi ( points -- pi-estimate )
    0 swap [
        [ random-point inside-circle? [ 1 + ] when ] times
    ] keep /f 4 * ;

We can run this for varying numbers of points and see how it gets more accurate:

IN: scratchpad { 100 1,000 10,000 100,000 1,000,000 10,000,000 }
               [ estimate-pi . ] each
3.2
3.168
3.162
3.15176
3.14288
3.1418212

Tuesday, December 10, 2013

UU Encoding

Just a quick note, as of a couple months ago, Factor has support for uuencoding (and uudecoding)!

You can perform a uuencode:

IN: scratchpad "Factor" string>uu print
begin
&1F%C=&]R
end

...and also a uudecode:

IN: scratchpad """
               begin 644 factor.txt
               &1F%C=&]R
               end
               """ uu>string .
"Factor"

Right now, it operates on text directly and doesn't preserve the file name and permissions in the begin header, but that would be an easy improvement.

The code for this is available in the development version of Factor.

Thursday, December 5, 2013

Humanhash

Zachary Voase published the humanhash project on GitHub, making "human-readable representations of digests". Below is a compatible implementation in Factor.

To show how it will work, we use the example from his humanhash README:

IN: scratchpad CONSTANT: digest "7528880a986c40e78c38115e640da2a1"

IN: scratchpad digest humanhash .
"three-georgia-xray-jig"

IN: scratchpad digest 6 humanhash-words .
"high-mango-white-oregon-purple-charlie"

IN: scratchpad human-uuid4 2array .
{
    "28129036-75a7-4c87-984b-4b32231e0a0d"
    "nineteen-bluebird-oxygen-edward"
}

Implementation

We need a list of 256 words, one to represent each possible byte:

CONSTANT: default-wordlist {
    "ack" "alabama" "alanine" "alaska" "alpha" "angel" "apart"
    "april" "arizona" "arkansas" "artist" "asparagus" "aspen"
    "august" "autumn" "avocado" "bacon" "bakerloo" "batman" "beer"
    "berlin" "beryllium" "black" "blossom" "blue" "bluebird" "bravo"
    "bulldog" "burger" "butter" "california" "carbon" "cardinal"
    "carolina" "carpet" "cat" "ceiling" "charlie" "chicken" "coffee"
    "cola" "cold" "colorado" "comet" "connecticut" "crazy" "cup"
    "dakota" "december" "delaware" "delta" "diet" "don" "double"
    "early" "earth" "east" "echo" "edward" "eight" "eighteen"
    "eleven" "emma" "enemy" "equal" "failed" "fanta" "fifteen"
    "fillet" "finch" "fish" "five" "fix" "floor" "florida"
    "football" "four" "fourteen" "foxtrot" "freddie" "friend"
    "fruit" "gee" "georgia" "glucose" "golf" "green" "grey" "hamper"
    "happy" "harry" "hawaii" "helium" "high" "hot" "hotel"
    "hydrogen" "idaho" "illinois" "india" "indigo" "ink" "iowa"
    "island" "item" "jersey" "jig" "johnny" "juliet" "july"
    "jupiter" "kansas" "kentucky" "kilo" "king" "kitten" "lactose"
    "lake" "lamp" "lemon" "leopard" "lima" "lion" "lithium" "london"
    "louisiana" "low" "magazine" "magnesium" "maine" "mango" "march"
    "mars" "maryland" "massachusetts" "may" "mexico" "michigan"
    "mike" "minnesota" "mirror" "mississippi" "missouri" "mobile"
    "mockingbird" "monkey" "montana" "moon" "mountain" "muppet"
    "music" "nebraska" "neptune" "network" "nevada" "nine"
    "nineteen" "nitrogen" "north" "november" "nuts" "october" "ohio"
    "oklahoma" "one" "orange" "oranges" "oregon" "oscar" "oven"
    "oxygen" "papa" "paris" "pasta" "pennsylvania" "pip" "pizza"
    "pluto" "potato" "princess" "purple" "quebec" "queen" "quiet"
    "red" "river" "robert" "robin" "romeo" "rugby" "sad" "salami"
    "saturn" "september" "seven" "seventeen" "shade" "sierra"
    "single" "sink" "six" "sixteen" "skylark" "snake" "social"
    "sodium" "solar" "south" "spaghetti" "speaker" "spring"
    "stairway" "steak" "stream" "summer" "sweet" "table" "tango"
    "ten" "tennessee" "tennis" "texas" "thirteen" "three" "timing"
    "triple" "twelve" "twenty" "two" "uncle" "undress" "uniform"
    "uranus" "utah" "vegan" "venus" "vermont" "victor" "video"
    "violet" "virginia" "washington" "west" "whiskey" "white"
    "william" "winner" "winter" "wisconsin" "wolfram" "wyoming"
    "xray" "yankee" "yellow" "zebra" "zulu"
}

One of the inputs is the number of words to produce. An error is produced if fewer bytes are provided than the number of words requested:

ERROR: too-few-bytes seq #words ;

: check-bytes ( seq #words -- seq #words )
    2dup [ length ] [ < ] bi* [ too-few-bytes ] when ; inline

Input is grouped into subsequences, where the number of subsequences is the number of words requested in the output. It's a little bit odd, but basically makes groups and then puts any remainder in the last group:

: group-words ( seq #words -- groups )
    [ dupd [ length ] [ /i ] bi* group ]
    [ 1 - cut concat suffix ] bi ; inline

Our input bytes are compressed, first by grouping them into words, then by XORing the bytes in each word:

: compress-bytes ( seq #words -- newseq )
    check-bytes group-words [ 0 [ bitxor ] reduce ] map ;

Our input will either be a byte-array, or a hex-string with every two characters representing the hexadecimal value of each byte:

: byte-string ( hexdigest -- seq )
    dup byte-array? [ 2 <groups> [ hex> ] map ] unless ;

Making a humanhash is simply converting the input, compressing into bytes representing each word, looking up the word from the word list, and joining with a requested separator:

: make-humanhash ( hexdigest #words wordlist sep -- hash )
    { [ byte-string ] [ compress-bytes ] [ nths ] [ join ] } spread ;

We provide a way to hash into a requested number of words, or four by default:

: humanhash-words ( hexdigest #words -- hash )
    default-wordlist "-" make-humanhash ;

: humanhash ( hexdigest -- hash )
    4 humanhash-words ;

And since the humanhash project includes a way to create humanhash'd uuids, we do also:

: human-uuid4 ( -- uuid hash )
    uuid4 dup [ CHAR: - = not ] filter humanhash ;