Wednesday, October 26, 2011

Disassemble

A neat trick that Factor provides is the ability to disassemble functions into the machine code that is generated by the compiler. In 2008, Slava Pestov created a disassembler, and has improved it a bit since then (switching to udis86 for its implementation).

Constant Folding

The compiler performs constant folding, using the compiler.tree.debugger vocabulary, you can output the optimized form of a quotation:

( scratchpad ) [ 2 2 + ] optimized.
[ 4 ]

Using the disassembler, you can see the machine code this generates:

( scratchpad ) [ 2 2 + ] disassemble
011c1a5530: 4983c608        add r14, 0x8
011c1a5534: 49c70640000000  mov qword [r14], 0x40
011c1a553b: c3              ret 
011c1a553c: 0000            add [rax], al
011c1a553e: 0000            add [rax], al

Local Variables

One of the questions that comes up sometimes is whether local variables affect performance. We can examine two words that add numbers together, one using locals and one just using the stack:

( scratchpad ) : foo ( x y -- z ) + ;

( scratchpad ) :: bar ( x y -- z ) x y + ;

The "optimized output" looks a little different:

( scratchpad ) \ foo optimized.
[ + ]

( scratchpad ) \ bar optimized.
[ "COMPLEX SHUFFLE" "COMPLEX SHUFFLE" R> + ]

But, the machine code that is generated is identical:

( scratchpad ) \ foo disassemble
01115de7b0: 488d1d05000000  lea rbx, [rip+0x5]
01115de7b7: e9e49439ff      jmp 0x110977ca0 (+)
01115de7bc: 0000            add [rax], al
01115de7be: 0000            add [rax], al

( scratchpad ) \ bar disassemble
01115ef620: 488d1d05000000  lea rbx, [rip+0x5]
01115ef627: e9748638ff      jmp 0x110977ca0 (+)
01115ef62c: 0000            add [rax], al
01115ef62e: 0000            add [rax], al

Dynamic Variables

Another frequently used feature is dynamic variables, implemented by the namespaces vocabulary. For example, the definition of the print word looks for the current value of the output-stream variable and then calls stream-print on it:

( scratchpad ) \ print see
USING: namespaces ;
IN: io
: print ( str -- ) output-stream get stream-print ; inline

The optimized output inlines the implementation of get:

( scratchpad ) [ "Hello, world" print ] optimized.
[
    "Hello, world" \ output-stream 0 context-object assoc-stack
    stream-print
]

You can inspect the machine code generated, seeing references to the factor words that are being called:

( scratchpad ) [ "Hello, world" print ] disassemble
011c0c6c40: 4c8d1df9ffffff        lea r11, [rip-0x7]
011c0c6c47: 6820000000            push dword 0x20
011c0c6c4c: 4153                  push r11
011c0c6c4e: 4883ec08              sub rsp, 0x8
011c0c6c52: 4983c618              add r14, 0x18
011c0c6c56: 48b8dbc5a31a01000000  mov rax, 0x11aa3c5db
011c0c6c60: 498946f0              mov [r14-0x10], rax
011c0c6c64: 498b4500              mov rax, [r13+0x0]
011c0c6c68: 488b4040              mov rax, [rax+0x40]
011c0c6c6c: 498906                mov [r14], rax
011c0c6c6f: 48b86c91810e01000000  mov rax, 0x10e81916c
011c0c6c79: 498946f8              mov [r14-0x8], rax
011c0c6c7d: e8de4e36ff            call 0x11b42bb60 (assoc-stack)
011c0c6c82: 4883c418              add rsp, 0x18
011c0c6c86: 488d1d05000000        lea rbx, [rip+0x5]
011c0c6c8d: e94e5264ff            jmp 0x11b70bee0 (stream-print)
011c0c6c92: 0000                  add [rax], al
011c0c6c94: 0000                  add [rax], al
011c0c6c96: 0000                  add [rax], al
011c0c6c98: 0000                  add [rax], al
011c0c6c9a: 0000                  add [rax], al
011c0c6c9c: 0000                  add [rax], al
011c0c6c9e: 0000                  add [rax], al

Wednesday, October 12, 2011

Optimizing 2^x

A great little article was posted last year about optimizing 2^x for doubles (by approximation). The author gets 40+% performance improvements in C#. I wondered if we could get similar improvements in Factor.

The basic idea is to use the fact that a number, ab+c can be factored into ab * ac. If we separate a floating point number into two components: an integer and a fractional part, we can show that:

2n = 2integer(n) * 2fractional(n).

We are going to approximate this value by using a lookup table to compute the fractional part (within a specified precision). For example, to compute within a 0.001 precision, we need 1000 lookup values, essentially performing this calculation:

2n = ( 1 << Int(n) ) * Table[ (int) ( Frac(n) * 1000 ) ];

Implementation

So, we need a word that can split a floating point number into those two values:

: float>parts ( x -- float int )
    dup >integer [ - ] keep ; inline

Instead of one table with 1000 values, we will copy the original authors decision to use three lookup tables for additional precision. The following code calculates these lookup tables:

CONSTANT: BITS1 10
CONSTANT: BITS2 $[ BITS1 2 * ]
CONSTANT: BITS3 $[ BITS1 3 * ]

CONSTANT: PRECISION1 $[ 1 BITS1 shift ]
CONSTANT: PRECISION2 $[ 1 BITS2 shift ]
CONSTANT: PRECISION3 $[ 1 BITS3 shift ]

CONSTANT: MASK $[ PRECISION1 1 - ]

CONSTANT: FRAC1 $[ 2 PRECISION1 iota [ PRECISION1 / ^ ] with map ]
CONSTANT: FRAC2 $[ 2 PRECISION1 iota [ PRECISION2 / ^ ] with map ]
CONSTANT: FRAC3 $[ 2 PRECISION1 iota [ PRECISION3 / ^ ] with map ]

The function pow2 looks pretty similar to our original mathematical definition:

: pow2 ( n -- 2^n )
    >float 2^int 2^frac * >float ;

The guts of the implementation is in the 2^int and 2^frac words:

: 2^int ( n -- 2^int frac )
    [ float>parts ] keep 0 >= [ 1 swap shift ] [
        over 0 < [ [ 1 + ] [ 1 - ] bi* ] when
        1 swap neg shift 1.0 swap /
    ] if swap ; inline

: 2^frac ( frac -- 2^frac )
    PRECISION3 * >fixnum
    [ BITS2 neg shift FRAC1 nth-unsafe ]
    [ BITS1 neg shift MASK bitand FRAC2 nth-unsafe ]
    [ MASK bitand FRAC3 nth-unsafe ] tri * * ; inline

Testing

Let's try it and see how well it works for small values:

( scratchpad ) 2 1.5 ^ .
2.82842712474619

( scratchpad ) 1.5 pow2 .
2.82842712474619

It seem's to work, how about larger values:

( scratchpad ) 2 16.3 ^ .
80684.28027297248

( scratchpad ) 16.3 pow2 .
80684.28026255539

The error is clearly detectable, but to test that it really works the way we expect it too, we will need to calculate relative error:

: relative-error ( approx value -- relative-error )
    [ - abs ] keep / ;

Our test case will generate random values, compute 2x using our approximation and verify the relative error is less than 0.000000001 when compared with the correct result:

[ t ] [
    10000 [ -20 20 uniform-random-float ] replicate
    [ [ pow2 ] [ 2 swap ^ ] bi relative-error ] map
    supremum 1e-9 <
] unit-test

And to verify performance, we will benchmark it against the built-in (and more accurate ^ word):

: pow2-test ( seq -- new old )
    [ [ pow2 drop ] [ each ] benchmark ]
    [ 2 swap [ ^ drop ] with [ each ] benchmark ] bi ;

We got almost a 40% performance improvement at the cost of a small loss of precision - not bad!

( scratchpad ) 10000 [ -20 20 uniform-random-float ] replicate
               pow2-test / >float .
0.6293153754782392

The code for this is on my Github.