## Thursday, October 25, 2012

### The 3n+1 Problem

Yesterday, a lengthy blog post from the Racket community discussed the "3n+1" problem (also known as the Collatz conjecture):

Consider a positive number n. The cycle length of n is the number of times we repeat the following, until we reach n=1:

• If n is odd, then n = 3n+1
• If n is even, then n = n/2

For example, given n=22, we see the following sequence: 22 11 34 17 52 26 13 40 20 10 5 16 8 4 2 1. The cycle length of 22 is, therefore, 16, since it took 16 repetitions to get to 1.

Given a definition of cycle length of a number, here’s the rest of the problem: given any two numbers `i` and `j`, compute the maximum cycle length over all numbers between `i` and `j`, inclusive.

## Solution

Solving this in Factor could start by defining a word to take a number and compute its next value in the cycle. Since this is an internal word, we won't add the check for the end condition of n=1:

```: next-cycle ( x -- y )
dup odd? [ 3 * 1 + ] [ 2 / ] if ;```

Creating the "cycle sequence" for a given number can be done by:

```: cycles ( n -- seq )
[ dup 1 > ] [ [ next-cycle ] keep ] produce swap suffix ;```

We could always find the "cycle length" by calling `cycles length`, but a slightly faster way would not create an intermediate sequence:

```: cycle-length ( n -- m )
1 [ over 1 > ] [ [ next-cycle ] [ 1 + ] bi* ] while nip ;```

We can find the maximum cycle (number and it's cycle length) for any given sequence of numbers:

```: max-cycle ( seq -- elt )
[ dup cycle-length ] { } map>assoc [ second ] supremum-by ;```

For convenience, we can make some helper words to find either the number and/or the maximum cycle length:

```: max-cycle-value ( seq -- n ) max-cycle first ;

: max-cycle-length ( seq -- m ) max-cycle second ;```

## Testing

Factor strives to be a concise language, some of which is natural for a concatenative style. You can see this in the simple approach to unit testing:.

```{
{ 22 11 34 17 52 26 13 40 20 10 5 16 8 4 2 1 }
} [ 22 cycles ] unit-test

{ 1  } [ 1 cycle-length ] unit-test
{ 2  } [ 2 cycle-length ] unit-test
{ 6  } [ 5 cycle-length ] unit-test
{ 16 } [ 22 cycle-length ] unit-test

{ 20  } [ 1 10 [a,b] max-cycle-length ] unit-test
{ 125 } [ 100 200 [a,b] max-cycle-length ] unit-test
{ 89  } [ 201 210 [a,b] max-cycle-length ] unit-test
{ 174 } [ 900 1000 [a,b] max-cycle-length ] unit-test```

## Performance

The original post discussed performance. Since many cycles have numbers in common, we could change our approach to a recursive one, with memoization:

```MEMO: fast-cycle-length ( n -- m )
dup 1 > [ next-cycle fast-cycle-length 1 + ] [ drop 1 ] if ;```

Indeed, this takes roughly 15-20% of the time that our previous approach took to compute the number of cycles for the numbers 1 to 100,000 with a cold cache.

## Main

Like the original solution, we can add a "main" to allow running from the command line:

```: run-cycles ( -- )
[
[ blank? ] split-when harvest first2
[ string>number ] bi@ 2dup [a,b] max-cycle-length
"%s %s %s\n" printf
] each-line ;

MAIN: run-cycles
```

And using the original sample file, produce the same output:

```\$ cat sample-data.txt
1 10
100 200
201 210
900 1000

\$ factor cycles.factor < sample-data.txt
1 10 20
100 200 125
201 210 89
900 1000 174```

The code for this is available on my Github.

## Tuesday, October 23, 2012

### Select Random Lines

Ned Batchelder made an interesting blog post on selecting randomly from an unknown sequence. His Python version uses a "random replacement" strategy to choose each line with a `1/n` probability, where `n` is the number of elements seen so far:

```import random

def random_element(seq):
"""Return an element chosen at random from `seq`."""
it = None
for n, elem in enumerate(seq):
if random.randint(0, n) == 0:
it = elem
return it```

## random-line

I wanted to show how a similar approach in Factor could be used to select a random line from a file, or any stream that supports a character-based stream protocol.

We can make a variant of each-line that calls a quotation with each line and its "line number" (indexed from 1 to `n`, the number of lines in the stream):

```: each-numbered-line ( ... quot: ( ... line number -- ... ) -- ... )
[ 1 ] dip '[ swap [ @ ] [ 1 + ] bi ] each-line drop ; inline```

Then, it is pretty easy to select a random line (replacing each line with chance of `1/K` probability where `K` is the current line number.

```: random-line ( -- line/f )
f [ random zero? [ nip ] [ drop ] if ] each-numbered-line ;```

## random-lines

To efficiently select more than one random line from a stream, we will be using a technique called reservoir sampling. The idea is to select the first `n` elements, then randomly replace them with weighted probability `n/K`, where `K` is the current line number:

```:: random-lines ( n -- lines )
V{ } clone :> accum
[| line line# |
line# n <= [
line accum push
] [
line# random :> r
r n < [ line r accum set-nth ] when
] if
] each-numbered-line accum ;```
Note: we used local variables to implement this.

This is available in the io.random vocabulary in the Factor development branch.

## Thursday, October 18, 2012

I came across an interesting question on StackOverflow asking about parsing IPv4 addresses. Typically, IPv4 addresses are specified with four components (e.g., something like `192.164.1.55`). As the top answer points out, you might be suprised to see that `ping` interprets addresses a bit oddly:

```C:\>ping 1
Pinging 0.0.0.1 with 32 bytes of data:

C:\>ping 1.2
Pinging 1.0.0.2 with 32 bytes of data:

C:\>ping 1.2.3
Pinging 1.2.0.3 with 32 bytes of data:

C:\>ping 1.2.3.4
Pinging 1.2.3.4 with 32 bytes of data:

C:\>ping 1.2.3.4.5
Ping request could not find host 1.2.3.4.5. Please check the name and try again.

C:\>ping 255
Pinging 0.0.0.255 with 32 bytes of data:

C:\>ping 256
Pinging 0.0.1.0 with 32 bytes of data:```

In fact, you can reach `google.com`, using the IP address specified as dotted decimal (`74.125.226.4`), flat decimal (`1249763844`), dotted octal (`0112.0175.0342.0004`), flat octal (`011237361004`), dotted hex (`0x4A.0x7D.0xE2.0x04`), flat hex (`0x4A7DE204`), or even something of each (`74.0175.0xe2.4`).

## Implementation

Of course, my first thought was that Factor should have a parser that works similarly (especially since I implemented support for the ping protocol awhile ago). We want a `parse-ipv4` word taking a string representing the address and returning an IPv4 address string that has the typical four components.

First, we need to have words to split a string into numbered parts and a word to join them back together:

```: split-components ( str -- array )
"." split [ string>number ] map ;

: join-components ( array -- str )
[ number>string ] map "." join ;```

Then, we can parse the address simply:

```: parse-ipv4 ( str -- ip )
split-components dup length {
{ 1 [ { 0 0 0 } prepend ] }
{ 2 [ first2 [| A D | { A 0 0 D } ] call ] }
{ 3 [ first3 [| A B D | { A B 0 D } ] call ] }
{ 4 [ ] }
} case join-components ;
```

## Extras

If we want to support octal addresses, we can convert an octal number like `0112` to something Factor can easily parse (`0o112`) in our splitting code:

```: cleanup-octal ( str -- str )
dup { [ "0" head? ] [ "0x" head? not ] } 1&&
[ 1 tail "0o" prepend ] when ;

: split-components ( str -- array )
"." split [ cleanup-octal string>number ] map ;```

And if we want to support the "carry propagation" which allows `256` to mean `0.0.1.0`, we need to "bubble" the array before joining:

```: bubble ( array -- array' )
reverse 0 swap [ + 256 /mod ] map reverse nip ;

: join-components ( array -- str )
bubble [ number>string ] map "." join ;```

This (along with some error handling) has been committed to the Factor repository in the `ip-parser` vocabulary. If it proves useful, it might be nice to change the `io.sockets` to use this when resolving IPv4 addresses...