Tuesday, October 23, 2012

Select Random Lines

Ned Batchelder made an interesting blog post on selecting randomly from an unknown sequence. His Python version uses a "random replacement" strategy to choose each line with a 1/n probability, where n is the number of elements seen so far:

import random

def random_element(seq):
    """Return an element chosen at random from `seq`."""
    it = None
    for n, elem in enumerate(seq):
        if random.randint(0, n) == 0:
            it = elem
    return it

random-line

I wanted to show how a similar approach in Factor could be used to select a random line from a file, or any stream that supports a character-based stream protocol.

We can make a variant of each-line that calls a quotation with each line and its "line number" (indexed from 1 to n, the number of lines in the stream):

: each-numbered-line ( ... quot: ( ... line number -- ... ) -- ... )
    [ 1 ] dip '[ swap [ @ ] [ 1 + ] bi ] each-line drop ; inline

Then, it is pretty easy to select a random line (replacing each line with chance of 1/K probability where K is the current line number.

: random-line ( -- line/f )
    f [ random zero? [ nip ] [ drop ] if ] each-numbered-line ;

random-lines

To efficiently select more than one random line from a stream, we will be using a technique called reservoir sampling. The idea is to select the first n elements, then randomly replace them with weighted probability n/K, where K is the current line number:

:: random-lines ( n -- lines )
    V{ } clone :> accum
    [| line line# |
        line# n <= [
            line accum push
        ] [
            line# random :> r
            r n < [ line r accum set-nth ] when
        ] if
    ] each-numbered-line accum ;
Note: we used local variables to implement this.

This is available in the io.random vocabulary in the Factor development branch.

No comments: