Re: Factor: October 2012

Thursday, October 25, 2012

The 3n+1 Problem

Yesterday, a lengthy blog post from the Racket community discussed the "3n+1" problem (also known as the Collatz conjecture):

Consider a positive number n. The cycle length of n is the number of times we repeat the following, until we reach n=1:

If n is odd, then n = 3n+1

If n is even, then n = n/2

For example, given n=22, we see the following sequence: 22 11 34 17 52 26 13 40 20 10 5 16 8 4 2 1. The cycle length of 22 is, therefore, 16, since it took 16 repetitions to get to 1.

Given a definition of cycle length of a number, here’s the rest of the problem: given any two numbers i and j, compute the maximum cycle length over all numbers between i and j, inclusive.

Solution

Solving this in Factor could start by defining a word to take a number and compute its next value in the cycle. Since this is an internal word, we won't add the check for the end condition of n=1:

: next-cycle ( x -- y )
    dup odd? [ 3 * 1 + ] [ 2 / ] if ;

Creating the "cycle sequence" for a given number can be done by:

: cycles ( n -- seq )
    [ dup 1 > ] [ [ next-cycle ] keep ] produce swap suffix ;

We could always find the "cycle length" by calling cycles length, but a slightly faster way would not create an intermediate sequence:

: cycle-length ( n -- m )
    1 [ over 1 > ] [ [ next-cycle ] [ 1 + ] bi* ] while nip ;

We can find the maximum cycle (number and it's cycle length) for any given sequence of numbers:

: max-cycle ( seq -- elt )
    [ dup cycle-length ] { } map>assoc [ second ] supremum-by ;

For convenience, we can make some helper words to find either the number and/or the maximum cycle length:

: max-cycle-value ( seq -- n ) max-cycle first ;

: max-cycle-length ( seq -- m ) max-cycle second ;

Testing

Factor strives to be a concise language, some of which is natural for a concatenative style. You can see this in the simple approach to unit testing:.

{
    { 22 11 34 17 52 26 13 40 20 10 5 16 8 4 2 1 }
} [ 22 cycles ] unit-test

{ 1  } [ 1 cycle-length ] unit-test
{ 2  } [ 2 cycle-length ] unit-test
{ 6  } [ 5 cycle-length ] unit-test
{ 16 } [ 22 cycle-length ] unit-test

{ 20  } [ 1 10 [a,b] max-cycle-length ] unit-test
{ 125 } [ 100 200 [a,b] max-cycle-length ] unit-test
{ 89  } [ 201 210 [a,b] max-cycle-length ] unit-test
{ 174 } [ 900 1000 [a,b] max-cycle-length ] unit-test

Performance

The original post discussed performance. Since many cycles have numbers in common, we could change our approach to a recursive one, with memoization:

MEMO: fast-cycle-length ( n -- m )
    dup 1 > [ next-cycle fast-cycle-length 1 + ] [ drop 1 ] if ;

Indeed, this takes roughly 15-20% of the time that our previous approach took to compute the number of cycles for the numbers 1 to 100,000 with a cold cache.

Main

Like the original solution, we can add a "main" to allow running from the command line:

: run-cycles ( -- )
    [
        [ blank? ] split-when harvest first2
        [ string>number ] bi@ 2dup [a,b] max-cycle-length
        "%s %s %s\n" printf
    ] each-line ;

MAIN: run-cycles

And using the original sample file, produce the same output:

$ cat sample-data.txt 
1 10
100 200
201 210
900 1000

$ factor cycles.factor < sample-data.txt
1 10 20
100 200 125
201 210 89
900 1000 174

The code for this is available on my Github.

Tuesday, October 23, 2012

Ned Batchelder made an interesting blog post on selecting randomly from an unknown sequence. His Python version uses a "random replacement" strategy to choose each line with a 1/n probability, where n is the number of elements seen so far:

import random

def random_element(seq):
    """Return an element chosen at random from `seq`."""
    it = None
    for n, elem in enumerate(seq):
        if random.randint(0, n) == 0:
            it = elem
    return it

random-line

I wanted to show how a similar approach in Factor could be used to select a random line from a file, or any stream that supports a character-based stream protocol.

We can make a variant of each-line that calls a quotation with each line and its "line number" (indexed from 1 to n, the number of lines in the stream):

: each-numbered-line ( ... quot: ( ... line number -- ... ) -- ... )
    [ 1 ] dip '[ swap [ @ ] [ 1 + ] bi ] each-line drop ; inline

Then, it is pretty easy to select a random line (replacing each line with chance of 1/K probability where K is the current line number.

: random-line ( -- line/f )
    f [ random zero? [ nip ] [ drop ] if ] each-numbered-line ;

random-lines

To efficiently select more than one random line from a stream, we will be using a technique called reservoir sampling. The idea is to select the first n elements, then randomly replace them with weighted probability n/K, where K is the current line number:

:: random-lines ( n -- lines )
    V{ } clone :> accum
    [| line line# |
        line# n <= [
            line accum push
        ] [
            line# random :> r
            r n < [ line r accum set-nth ] when
        ] if
    ] each-numbered-line accum ;

Note: we used local variables to implement this.

This is available in the io.random vocabulary in the Factor development branch.

Thursday, October 18, 2012

Parsing IPv4 Addresses

I came across an interesting question on StackOverflow asking about parsing IPv4 addresses. Typically, IPv4 addresses are specified with four components (e.g., something like 192.164.1.55). As the top answer points out, you might be suprised to see that ping interprets addresses a bit oddly:

C:\>ping 1
Pinging 0.0.0.1 with 32 bytes of data:

C:\>ping 1.2
Pinging 1.0.0.2 with 32 bytes of data:

C:\>ping 1.2.3
Pinging 1.2.0.3 with 32 bytes of data:

C:\>ping 1.2.3.4
Pinging 1.2.3.4 with 32 bytes of data:

C:\>ping 1.2.3.4.5
Ping request could not find host 1.2.3.4.5. Please check the name and try again.

C:\>ping 255
Pinging 0.0.0.255 with 32 bytes of data:

C:\>ping 256
Pinging 0.0.1.0 with 32 bytes of data:

In fact, you can reach google.com, using the IP address specified as dotted decimal (74.125.226.4), flat decimal (1249763844), dotted octal (0112.0175.0342.0004), flat octal (011237361004), dotted hex (0x4A.0x7D.0xE2.0x04), flat hex (0x4A7DE204), or even something of each (74.0175.0xe2.4).

Implementation

Of course, my first thought was that Factor should have a parser that works similarly (especially since I implemented support for the ping protocol awhile ago). We want a parse-ipv4 word taking a string representing the address and returning an IPv4 address string that has the typical four components.

First, we need to have words to split a string into numbered parts and a word to join them back together:

: split-components ( str -- array )
    "." split [ string>number ] map ;

: join-components ( array -- str )
    [ number>string ] map "." join ;

Then, we can parse the address simply:

: parse-ipv4 ( str -- ip )
    split-components dup length {
        { 1 [ { 0 0 0 } prepend ] }
        { 2 [ first2 [| A D | { A 0 0 D } ] call ] }
        { 3 [ first3 [| A B D | { A B 0 D } ] call ] }
        { 4 [ ] }
    } case join-components ;

Extras

If we want to support octal addresses, we can convert an octal number like 0112 to something Factor can easily parse (0o112) in our splitting code:

: cleanup-octal ( str -- str )
    dup { [ "0" head? ] [ "0x" head? not ] } 1&&
    [ 1 tail "0o" prepend ] when ;

: split-components ( str -- array )
    "." split [ cleanup-octal string>number ] map ;

And if we want to support the "carry propagation" which allows 256 to mean 0.0.1.0, we need to "bubble" the array before joining:

: bubble ( array -- array' )
    reverse 0 swap [ + 256 /mod ] map reverse nip ;

: join-components ( array -- str )
    bubble [ number>string ] map "." join ;

This (along with some error handling) has been committed to the Factor repository in the ip-parser vocabulary. If it proves useful, it might be nice to change the io.sockets to use this when resolving IPv4 addresses...