Friday, June 19, 2015

Bit Test

Factor has a bit? generic that is used to test an integer to see if a particular bit is set. When operating on fixnum integers, this is implemented by the fixnum-bit? word:

: fixnum-bit? ( x n -- ? )
    { fixnum fixnum } declare
    dup 0 >= [ neg shift even? not ] [ 2drop f ] if 

Many CPU architectures have instructions for performing a Bit Test. On x86, the BT instruction is available. Below, we are going to implement a compiler intrinsic in the hopes of speeding up this operation in Factor.

Intrinsic

In cpu.architecture, add a generic %bit-test based on our CPU:

HOOK: %bit-test cpu ( dst src1 src2 temp -- )

In cpu.x86, implement %bit-test on x86 (returning a boolean using a temporary register and the CMOVB instruction to check the carry flag which holds the result of the bit test):

M:: x86 %bit-test ( dst src1 src2 temp -- )
    src1 src2 BT
    dst temp \ CMOVB (%boolean) ;

In compiler.cfg.instructions, add a ##bit-test instruction:

FOLDABLE-INSN: ##bit-test
def: dst/tagged-rep
use: src1/int-rep src2/int-rep
temp: temp/int-rep ;

In compiler.codegen, we link the ##bit-test instruction with the %bit-test word.

CODEGEN: ##bit-test %bit-test

In compiler.cfg.intrinsics, enable replacing fixnum-bit? with the %bit-test intrinsic:

: enable-bit-test ( -- )
    {
        { fixnum-bit? [ drop [ ^^bit-test ] binary-op ] }
    } enable-intrinsics ;

Disassemble

The old assembly using fixnum-bit? looked like this:

000000010ea00250: mov [rip-0x1ff256], eax
000000010ea00256: sub rsp, 0x8
000000010ea0025a: call 0x10eedf3b0 (integer>fixnum-strict)
000000010ea0025f: mov rax, [r14]
000000010ea00262: cmp rax, 0x0
000000010ea00266: jl 0x10ea002a7 (M\ fixnum bit? + 0x57)
000000010ea0026c: neg rax
000000010ea0026f: mov [r14], rax
000000010ea00272: call 0x10ebec060 (fixnum-shift)
000000010ea00277: mov rax, [r14]
000000010ea0027a: test rax, 0x10
000000010ea00281: mov rax, 0x1
000000010ea0028b: mov rbx, 0x1010eff4c
000000010ea00295: cmovnz rax, rbx
000000010ea00299: mov [r14], rax
000000010ea0029c: mov [rip-0x1ff2a2], eax
000000010ea002a2: add rsp, 0x8
000000010ea002a6: ret
000000010ea002a7: sub r14, 0x8
000000010ea002ab: mov qword [r14], 0x1
000000010ea002b2: mov [rip-0x1ff2b8], eax
000000010ea002b8: add rsp, 0x8
000000010ea002bc: ret

The new assembly looks like this with BT:

000000010e656ec0: mov [rip-0x293ec6], eax
000000010e656ec6: sub rsp, 0x8
000000010e656eca: call 0x10e467ea0 (integer>fixnum-strict)
000000010e656ecf: mov rax, [r14]
000000010e656ed2: mov rbx, [r14-0x8]
000000010e656ed6: sar rbx, 0x4
000000010e656eda: sar rax, 0x4
000000010e656ede: bt rbx, rax
000000010e656ee2: mov rbx, 0x1
000000010e656eec: mov rcx, 0x1018f169c
000000010e656ef6: cmovb rbx, rcx
000000010e656efa: sub r14, 0x8
000000010e656efe: mov [r14], rbx
000000010e656f01: mov [rip-0x293f07], eax
000000010e656f07: add rsp, 0x8
000000010e656f0b: ret

Performance

Turns out that this speeds up fixnum-bit? by a lot. Using a simple test case:

: bench-fixnum-bit ( x n -- ? )
    { fixnum fixnum } declare
    [ 0 100,000,000 ] 2dip
    '[ _ _ bit? [ 1 + ] when ] times ;

Our old version was decently fast:

IN: scratchpad gc [ 0b101010101 0 bench-fixnum-bit ] time
Running time: 0.821433838 seconds

But our new version is much faster!

IN: scratchpad gc [ 0b101010101 0 bench-fixnum-bit ] time
Running time: 0.239439108 seconds

This has been committed to the development version and is available now.

Wednesday, June 17, 2015

Compressed Sieve of Eratosthenes

In my previous post, we used "2-3-5-7" wheel factorization to produce a faster Sieve of Eratosthenes with a compression factor of 2 (by storing only the odd numbers). Samuel Tardieu reminded me that the sieve variation used by Factor uses less memory with a higher compression factor of 3.75.

I thought would be a fun implementation to show below.

Version 6

The "2-3-5" wheel with the first 30 numbers that are not divisible by 2, 3, or 5.

CONSTANT: wheel-2-3-5 $[
    30 [1,b] [
        { 2 3 5 } [ divisor? ] with any? not
    ] B{ } filter-as
]
Note: We use 30 here because the "{ 2 3 5 } product" cycle is 30.

If you look at the wheel, you see that there are 8 candidates in every 30 numbers:

IN: scratchpad wheel-2-3-5 .
{ 1 7 11 13 17 19 23 29 }

The number 8 is interesting because it suggests that we can use a single byte to store the 8 candidates in each block of 30 numbers, using a bitmask for possible primes or f:

CONSTANT: masks $[
    30 [0,b) [
        wheel-2-3-5 index [ 7 swap - 2^ ] [ f ] if*
    ] map
]

We can index into this array to find the byte/mask for every number:

: byte-mask ( n -- byte mask/f )
    30 /mod masks nth ;

Using the byte/mask we can check if a number is marked in the sieve byte-array.

:: marked? ( n sieve -- ? )
    n byte-mask :> ( byte mask )
    mask [ byte sieve nth mask bitand zero? not ] [ f ] if ;

And we can unmark a number in the sieve in a similar fashion:

:: unmark ( n sieve -- )
    n byte-mask :> ( byte mask )
    mask [ byte sieve [ mask bitnot bitand ] change-nth ] when ;

Using that, we can unmark multiples of prime numbers:

:: unmark-multiples ( i upper sieve -- )
    i sq upper i 2 * <range> [ sieve unmark ] each ;

That's all we need to build our sieve, starting from a byte-array initialized with all numbers marked and then progressively unmarking multiples of primes:

:: sieve6 ( n -- sieve )
    n 30 /i 1 + [ 0xff ] B{ } replicate-as :> sieve
    sieve length 30 * 1 - :> upper
    3 upper sqrt 2 <range> [| i |
        i sieve marked? [
            i upper sieve unmark-multiples
        ] when
    ] each sieve ;

As you can see, storing prime numbers up to 10 million takes only 333 KB!

IN: scratchpad 10,000,000 sieve6 length .
333334

This approach is used by the math.primes vocabulary to quickly check primality of any number less than 9 million using only 300 KB of memory.

Neat!

Monday, June 8, 2015

Genuine Sieve of Eratosthenes

Iain Gray posted on the mailing list about a paper called the Genuine Sieve of Eratosthenes by Melissa O'Neill, a Professor of Computer Science at Harvey Mudd College.

It begins with a discussion about some clever looking Haskell code that is just trial division and not the Sieve of Eratosthenes. At the end of the paper is an interesting discussion of performance improvements that I thought might be fun to implement in Factor as a followup to the three versions that I posted about recently.

Version 4

We are going to use wheel factorization as a way of reducing the search space by ignoring some multiples of small prime numbers. It is a bit similar to ignoring all even numbers in our previous "Version 3". In this case, we want to ignore all multiples of the first four prime numbers.

To calculate the "2-3-5-7" wheel, we start with 11 and filter the next 210 numbers that are not divisible by 2, 3, 5, or 7.

CONSTANT: wheel-2-3-5-7 $[
    11 dup 210 + [a,b] [
        { 2 3 5 7 } [ divisor? ] with any? not
    ] B{ } filter-as differences
]
Note: We use 210 here because the "{ 2 3 5 7 } product" cycle is 210.

Next, we want a way to iterate across all the numbers that might be prime, calling a quotation on each number that is prime.

:: each-prime ( upto sieve quot -- )
    11 upto '[ dup _ <= ] [
        wheel-2-3-5-7 [
            over dup 2/ sieve nth [ drop ] quot if
            +
        ] each
    ] while drop ; inline

For each prime found, we will want to mark all odd multiples as not prime.

:: mark-multiples ( i upto sieve -- )
    i sq upto i 2 * <range> [| j |
        t j 2/ sieve set-nth
    ] each ; inline

We will use a bit-array that is sized in 210-bit blocks, so the number of bits needed is:

: sieve-bits ( n -- bits )
    210 /i 1 + 210 * 2/ 6 + ; inline

Finally, we calculate our sieve, first for 11 and then for all the primes above 11.

:: sieve4 ( n -- primes )
    n sieve-bits <bit-array> :> sieve
    t 0 sieve set-nth
    t 4 sieve set-nth
    n sqrt sieve [ n sieve mark-multiples ] each-prime
    V{ 2 3 5 7 } clone :> primes
    n sieve [
        dup n <= [ primes push ] [ drop ] if
    ] each-prime primes ;

Calculating prime numbers up to 10 million is getting even faster.

IN: scratchpad gc [ 10,000,000 sieve4 ] time
Running time: 0.08523477 seconds

Version 5

If you use the compiler.tree.debugger to inspect the optimized output, you can see some dynamic dispatch with branches for fixed-width (fixnum) and arbitrary-precision (bignum) integers.

It would be nice to perform fixnum specialization or improve type propagation in the compiler, but for now we are going to write some ugly code to force fixnum math for performance.

We make a version of each-prime that inlines the iteration:

:: each-prime2 ( upto sieve quot -- )
    11 upto >fixnum '[ dup _ <= ] [
        wheel-2-3-5-7 [
            over dup 2/ sieve nth-unsafe [ drop ] quot if
            fixnum+fast
        ] each
    ] while drop ; inline

And similarly for mark-multiples:

:: mark-multiples2 ( i upto sieve -- )
    i 2 fixnum*fast :> step
    i i fixnum*fast upto >fixnum '[ dup _ <= ] [
        t over 2/ sieve set-nth-unsafe
        step fixnum+fast
    ] while drop ; inline

And use them in our sieve computation:

:: sieve5 ( n -- primes )
    n sieve-bits <bit-array> :> sieve
    t 0 sieve set-nth
    t 4 sieve set-nth
    n sqrt sieve [ n sieve mark-multiples2 ] each-prime2
    V{ 2 3 5 7 } clone :> primes
    n sieve [
        dup n <= [ primes push ] [ drop ] if
    ] each-prime2 primes ;

Calculating prime numbers up to 10 million is super fast!

IN: scratchpad gc [ 10,000,000 sieve5 ] time
Running time: 0.033914378 seconds

We can compute all the 11,078,937 prime numbers up to 200 million in about a second!

IN: scratchpad gc [ 200,000,000 sieve5 ] time
Running time: 0.970876855 seconds

And all the 50,847,534 prime numbers up to 1 billion in about six seconds!

IN: scratchpad gc [ 1,000,000,000 sieve5 ] time
Running time: 6.141224805 seconds

Not quite primegen or primesieve, but not bad for a laptop!

This is available in the math.primes.erato.fast vocabulary on our nightly builds!

Sunday, June 7, 2015

Sieve of Eratosthenes

I noticed that the Crystal Programming Language homepage has an example of the Sieve of Eratosthenes algorithm for finding prime numbers. I thought it would be fun to show a few simple variations of that algorithm in Factor.

The algorithm works by iteratively marking as not prime the multiples of each prime, starting with the multiples of 2. In pseudocode, it looks like this:

Input: an integer n > 1
 
Let A be an array of Boolean values, indexed by integers 2 to n,
initially all set to true.
 
 for i = 2, 3, 4, ..., not exceeding n:
  if A[i] is true:
    for j = i2, i2+i, i2+2i, i2+3i, ..., not exceeding n :
      A[j] := false
 
Output: all i such that A[i] is true.

We will be looking at the performance and memory usage to calculate the 664,579 prime numbers between 1 and 10,000,000.

Version 1

Our first version, is a straightforward implementation from the pseudocode:

:: sieve1 ( n -- primes )
    n 1 + t <array> :> sieve
    f 0 sieve set-nth
    f 1 sieve set-nth

    2 n sqrt [a,b] [| i |
        i sieve nth [
            i sq n i <range> [| j |
                f j sieve set-nth
            ] each
        ] when
    ] each

    sieve [ drop ] selector [ each-index ] dip ;

We can show that it works for the first few prime numbers:

IN: scratchpad 25 sieve1 .
V{ 2 3 5 7 11 13 17 19 23 }

Calculating prime numbers up to 10 million takes less than half a second:

IN: scratchpad [ 10,000,000 sieve1 ] time
Running time: 0.408065579 seconds

The memory used is basically an array with 10 million elements in it, each element of the array is a boolean (64 bits on my laptop). Total memory used is 80 MB.

Version 2

Storing booleans can be more memory-efficient using bit-arrays, where each boolean is stored as a single bit. The logic is the same, however primes are marked by false rather than true.

:: sieve2 ( n -- primes )
    n 1 + <bit-array> :> sieve
    t 0 sieve set-nth
    t 1 sieve set-nth

    2 n sqrt [a,b] [| i |
        i sieve nth [
            i sq n i <range> [| j |
                t j sieve set-nth
            ] each
        ] unless
    ] each

    sieve [ drop not ] selector [ each-index ] dip ;

Calculating prime numbers up to 10 million is faster:

IN: scratchpad [ 10,000,000 sieve2 ] time
Running time: 0.241894524 seconds

The memory used is now one bit per number, so only 10 million bits or 1.25 MB.

Version 3

Since we know that even numbers (except 2) are not prime, we can only work with odd numbers and start our search from 3. For each number found, we can check only the odd multiples by marking off i2 + k*i for only even k.

:: sieve3 ( n -- sieve )
    n dup odd? [ 1 + ] when 2/ <bit-array> :> sieve
    t 0 sieve set-nth

    3 n sqrt 2 <range> [| i |
        i 2/ sieve nth [
            i sq n i 2 * <range> [| j |
                t j 2/ sieve set-nth
            ] each
        ] unless
    ] each

    V{ 2 } clone :> primes

    sieve [
        swap [ drop ] [ 2 * 1 + primes push ] if
    ] each-index

    primes ;
Note: the convenience of selector is not available because we want to initialize the list of primes with 2. Instead, we just do it manually.

Calculating prime numbers up to 10 million is much faster!

IN: scratchpad [ 10,000,000 sieve3 ] time
Running time: 0.110387127 seconds

As for memory usage, it is storing one bit for each odd number, so only 5 million bits or 625 KB.

Monday, June 1, 2015

SEND + MORE = MONEY?

There's a classic verbal arithmetic puzzle that was published in 1924 by Henry Dudeney, where you solve the equation:

    S E N D
+   M O R E
-----------
= M O N E Y
Note: Each letter is a unique digit and S and M can't be zero.

A few days ago, I noticed a blog post with solutions in Perl 6 that references another blog post with a solution in Haskell. I thought I'd show a solution in Factor using the backtrack vocabulary that provides something like John McCarthy's amb ambiguous operator.

First, we need a list of available digits, just the numbers 0 through 9:

CONSTANT: digits { 0 1 2 3 4 5 6 7 8 9 }

Next, we turn a sequence of digits into a number (e.g., { 1 2 3 4 } becomes 1234):

: >number ( seq -- n ) 0 [ [ 10 * ] dip + ] reduce ;

We can then implement our solver, by choosing digits while progressively restricting the possible values for future digits using the ones we've chosen so far (using local variables to store the digits):

:: send-more-money ( -- )
    [
        digits { 0 } diff amb-lazy :> s
        digits { s } diff amb-lazy :> e
        digits { s e } diff amb-lazy :> n
        digits { s e n } diff amb-lazy :> d
        { s e n d } >number :> send

        digits { 0 s e n d } diff amb-lazy :> m
        digits { s e n d m } diff amb-lazy :> o
        digits { s e n d m o } diff amb-lazy :> r
        { m o r e } >number :> more

        digits { s e n d m o r } diff amb-lazy :> y
        { m o n e y } >number :> money

        send more + money = [
            send more money "   %s\n+  %s\n= %s\n" printf
        ] when

    ] amb-all ;
Note: We search all solutions using amb-all (even though there is only one). In this case, it is effectively an iterative search, which we could implement without backtracking. If we wanted the first solution, we could use if-amb.

So, what's the answer? Let's see!

IN: scratchpad send-more-money
   9567
+  1085
= 10652

Neat! And it's fast too -- solving this puzzle in about 2.5 seconds on my laptop.

The code for this is on my GitHub.