Tuesday, February 26, 2013

Fast "shuffle"

Randomly shuffling the elements in an array is a pretty common task. The standard algorithm for this is called the Fisher-Yates shuffle. The modern version of it looks like this in pseudo-code:

To shuffle an array a of n elements (indices 0..n-1):
  for i from n − 1 downto 1 do
       j ← random integer with 0 ≤ ji
       exchange a[j] and a[i]

We're going to implement this in Factor, and then look at improving the performance when shuffling an array of 10,000,000 numbers (generated with "10,000,000 iota >array"). Afterwards, we are going to look at Ruby and Python versions.


Version 1:

Our first implementation is a straight-forward translation of the algorithm (a simplification of the randomize word included in the Factor standard library), and looks something like this:

: shuffle1 ( seq -- seq )
    dup length [ dup 1 > ] [
       [ random ] [ 1 - ] bi
       [ pick exchange ] keep
    ] while drop ;

On our test case, this takes 2.972 seconds.

Version 2:

Our second implementation uses exchange-unsafe, which does not perform bounds checks (unnecessary given our loop constraints).

: shuffle2 ( seq -- seq )
    dup length [ dup 1 > ] [
       [ random ] [ 1 - ] bi
       [ pick exchange-unsafe ] keep
    ] while drop ;

On our test case, this takes 2.830 seconds (a 5% improvement on version 1!).

Version 3:

Our third implementation uses the typed vocabulary to provide type information to the compiler to produce more optimal code:

TYPED: shuffle3 ( seq: array -- seq )
    dup length [ dup 1 > ] [
       [ random ] [ 1 - ] bi
       [ pick exchange-unsafe ] keep
    ] while drop ;

On our test case, this takes 2.554 seconds (a 15% improvement on version 1!).

Version 4:

Our fourth implementation instead removes the generic dispatch of calling random as well as the dynamic variable lookup for the current random number generator:

: shuffle4 ( seq -- seq )
    dup length [ dup 1 > ]
    random-generator get '[
       [ _ (random-integer) ] [ 1 - ] bi
       [ pick exchange-unsafe ] keep
    ] while drop ;

On our test case, this takes 2.408 seconds (a 19% improvement on version 1!).

Version 5:

Our fifth implementation combines the optimizations in version 3 and 4:

TYPED: shuffle5 ( seq: array -- seq )
    dup length [ dup 1 > ]
    random-generator get '[
        [ _ (random-integer) ] [ 1 - ] bi
        [ pick exchange-unsafe ] keep
    ] while drop ;

On our test case, this takes 2.254 seconds (a 24% improvement on version 1!).

Version 6:

Our sixth implementation declares that the indices to exchange-unsafe are fixnums (removing a minor check that isn't optimized out by the compiler for some reason):

TYPED: shuffle6 ( seq: array -- seq )
    dup length [ dup 1 > ]
    random-generator get '[
        [ _ (random-integer) ] [ 1 - ] bi
        { fixnum fixnum } declare
        [ pick exchange-unsafe ] keep
    ] while drop ;

On our test case, this takes 2.187 seconds (a 27% improvement on version 1!).

Version 7:

Our seventh implementation is just version 6 with a quick-and-dirty random number generator using rand():

FUNCTION: int rand ( ) ;

SINGLETON: c-random
M: c-random random-32* drop rand ;

: with-c-random ( quot -- )
    [ c-random ] dip with-random ; inline

On our test case, this takes 2.015 seconds (a 32% improvement on version 1!).

Version 8:

Our eighth (and final!) version drops down to C, to implement a primitive version of shuffle:

void factor_vm::primitive_shuffle_array()
    array* a = untag_check<array>(ctx->peek());
    cell capacity = array_capacity(a);
    for (cell i = capacity - 1; i > 0; i--) {
        cell j = i + rand() / (RAND_MAX / (capacity - i) + 1);
        cell tmp = array_nth(a, i);
        set_array_nth(a, i, array_nth(a, j));
        set_array_nth(a, j, tmp);

On our test case, this takes 0.494 seconds (a 83% improvement on version 1!).


I wanted to see how various Python versions performed, so I wrote a simple version that takes 19.025 seconds on Python 2.7.3, 12.436 seconds on Jython 2.5.3, and 1.392 seconds with PyPy 1.9.0:

from random import randrange
import time

n = 10000000
l = list(xrange(n))
t0 = time.time()
i = n
while i > 1:
    j = randrange(i)
    i -= 1
    l[i], l[j] = l[j], l[i]
print 'took %.3f seconds' % (time.time() - t0)

A version using Numpy takes 3.273 seconds:

import numpy as np
import time

a = np.arange(10000000)
t0 = time.time()
print 'took %.3f seconds' % (time.time() - t0)


Curious what Ruby would look like, I tested two versions. The first implements shuffle in "pure" Ruby, taking 149.975 seconds on Ruby 1.8.7, 25.086 seconds in Ruby 1.9.3, 8.054 seconds on MacRuby 0.12, 7.258 seconds on JRuby 1.7.3, and 4.682 seconds on Rubinius 1.2.4:

def shuffle!(a)
    size = a.length
    size.times do |i|
        r = i + Kernel.rand(size - i)
        a[i], a[r] = a[r], a[i]

a = (1..10000000).to_a
t0 = Time.new
t1 = Time.new
printf "took %.3f seconds", t1-t0

But since Ruby 1.8.7, Array.shuffle! is a builtin (implemented in C), so lets look at how performance differs. It takes 0.443 seconds in Ruby 1.8.7, 0.554 seconds in Ruby 1.9.3, 0.884 seconds in MacRuby 0.12, 2.170 seconds in JRuby 1.7.3, and 3.479 seconds in Rubinius 1.2.4:

a = (1..10000000).to_a
t0 = Time.new
a = a.shuffle!
t1 = Time.new
printf "took %.3f seconds", t1-t0


C is awesome. PyPy is really fast. Factor can be pretty fast. Numpy should be faster. Jython could be faster. Python isn't very fast at all. Ruby is both blazing fast and slower than snails!


Redline6561 said...

Just for giggles, I got the following results in my Debian64 VM with SBCL 1.14*:
[*] PRE and CODE tags were rejected. Sorry if this looks like garbage.

(defun shuffle (arr)
(loop for i from (1- (length arr)) downto 1
do (rotatef (aref arr (random i)) (aref arr i))))

CL-USER> (time (shuffle *toy*))
Evaluation took:
1.065 seconds of real time
1.524095 seconds of total run time (1.524095 user, 0.000000 system)
143.10% CPU
3,161,941,417 processor cycles
0 bytes consed

Taking away the specialization on arrays with AREF, we get a slightly slower version:

(defun shuffle2 (seq)
(loop for i from (1- (length seq)) downto 1
do (rotatef (elt seq (random i)) (elt seq i))))

CL-USER> (time (shuffle2 *toy*))
Evaluation took:
1.405 seconds of real time
2.032127 seconds of total run time (2.032127 user, 0.000000 system)
144.63% CPU
4,165,084,903 processor cycles
28,128 bytes consed

Redline6561 said...

Adding (declaim (ftype (function ((simple-vector))) shuffle)) gets my lisp version down to ~0.6 seconds.

A luajit 2.0 version with no type declarations performs in about 0.63 seconds.

math = require("math")
os = require("os")

local toy = {}
for i=1, 10000000 do
toy[i] = math.random(10000000)

function shuffle(seq)
for i=(#seq-1), 1, -1 do
j = math.random(i)
seq[i], seq[j] = seq[j], seq[i]

halex said...

You really should mention the python builtin function shuffle when writing about the ruby builtin too. In my case the runtime went down a factor of 3 when using this bultin shuffle function compared to your naive one (with plain Python 2.7.3).

Andrew Dalke said...

CPython uses a dictionary lookup for module variables. Local variables have an optimized O(1) lookup because it's impossible to add new variables to local scope after a function has been declared.

If I wrap the code in a "def main():" and call it with "main()", then I get a drop from 20 seconds to 17.

Two other minor optimizations are to use range(n) instead of list(xrange(n)), and to make randrange be a local variable ("import random" then before the main loop "randrange = random.randrange"). This turns a large number of module lookup+attribute lookup calls into a single fast local lookup call.

Pypy does this last optimization automatically.

Using the suggestion of halex, random.shuffle on the array takes about 8 seconds in CPython.

mrjbq7 said...

@Redline6561: Nice examples!

@halex: I forgot about builtin shuffle, that would make a nice comparison with the builtin Ruby version.

@AndrewDalke: Great points! I wasn't trying to make Python look bad at all, just to show how performance is all over the place. I know there's many versions of optimizations you could do in Python (similar to how I try and speed up the Factor version) including writing it in C to be "fastest".

It's fun that you can usually write C in any language, but it's refreshing to see PyPy and LuaJit allowing C-like performance with more simple looking code.


Anonymous said...

@Redline6561: Your Lisp code is broken, you need to use (random (1+ i)) instead of (random i) to get an index in the right interval.

Also, please do not use ELT instead of AREF, the idea of using this algorithm on lists is disgusting.

xss said...

For python, you not need to implement it. It's already implemented under random.shuffle (http://docs.python.org/2/library/random.html#random.shuffle)

And you can use range instead of xrange and then go through and build an array (it's faster)

On my laptoon (2 years old MacBook air) and python 2.7.1 the result is:
took 13.629 seconds