Randomly shuffling the elements in an array is a pretty common task. The standard algorithm for this is called the Fisher-Yates shuffle. The modern version of it looks like this in pseudo-code:

To shuffle an arrayaofnelements (indices 0..n-1):forifromn− 1downto1doj← random integer with 0 ≤j≤iexchangea[j] anda[i]

We're going to implement this in Factor, and then look at improving the performance when shuffling an array of 10,000,000 numbers (generated with "`10,000,000 iota >array`

"). Afterwards, we are going to look at Ruby and Python versions.

## Factor

**Version 1:**

Our first implementation is a straight-forward translation of the algorithm (a simplification of the randomize word included in the Factor standard library), and looks something like this:

: shuffle1 ( seq -- seq ) dup length [ dup 1 > ] [ [ random ] [ 1 - ] bi [ pick exchange ] keep ] while drop ;

On our test case, this takes **2.972** seconds.

**Version 2:**

Our second implementation uses exchange-unsafe, which does not perform bounds checks (unnecessary given our loop constraints).

: shuffle2 ( seq -- seq ) dup length [ dup 1 > ] [ [ random ] [ 1 - ] bi [ pick exchange-unsafe ] keep ] while drop ;

On our test case, this takes **2.830** seconds (a 5% improvement on version 1!).

**Version 3:**

Our third implementation uses the typed vocabulary to provide type information to the compiler to produce more optimal code:

TYPED: shuffle3 ( seq: array -- seq ) dup length [ dup 1 > ] [ [ random ] [ 1 - ] bi [ pick exchange-unsafe ] keep ] while drop ;

On our test case, this takes **2.554** seconds (a 15% improvement on version 1!).

**Version 4:**

Our fourth implementation instead removes the generic dispatch of calling random as well as the dynamic variable lookup for the current random number generator:

: shuffle4 ( seq -- seq ) dup length [ dup 1 > ] random-generator get '[ [ _ (random-integer) ] [ 1 - ] bi [ pick exchange-unsafe ] keep ] while drop ;

On our test case, this takes **2.408** seconds (a 19% improvement on version 1!).

**Version 5:**

Our fifth implementation combines the optimizations in version 3 and 4:

TYPED: shuffle5 ( seq: array -- seq ) dup length [ dup 1 > ] random-generator get '[ [ _ (random-integer) ] [ 1 - ] bi [ pick exchange-unsafe ] keep ] while drop ;

On our test case, this takes **2.254** seconds (a 24% improvement on version 1!).

**Version 6:**

Our sixth implementation declares that the indices to exchange-unsafe are fixnums (removing a minor check that isn't optimized out by the compiler for some reason):

TYPED: shuffle6 ( seq: array -- seq ) dup length [ dup 1 > ] random-generator get '[ [ _ (random-integer) ] [ 1 - ] bi { fixnum fixnum } declare [ pick exchange-unsafe ] keep ] while drop ;

On our test case, this takes **2.187** seconds (a 27% improvement on version 1!).

**Version 7:**

Our seventh implementation is just version 6 with a quick-and-dirty random number generator using rand():

FUNCTION: int rand ( ) ; SINGLETON: c-random M: c-random random-32* drop rand ; : with-c-random ( quot -- ) [ c-random ] dip with-random ; inline

On our test case, this takes **2.015** seconds (a 32% improvement on version 1!).

**Version 8:**

Our eighth (and final!) version drops down to C, to implement a primitive version of shuffle:

void factor_vm::primitive_shuffle_array() { array* a = untag_check<array>(ctx->peek()); cell capacity = array_capacity(a); for (cell i = capacity - 1; i > 0; i--) { cell j = i + rand() / (RAND_MAX / (capacity - i) + 1); cell tmp = array_nth(a, i); set_array_nth(a, i, array_nth(a, j)); set_array_nth(a, j, tmp); } }

On our test case, this takes **0.494** seconds (a 83% improvement on version 1!).

## Python

I wanted to see how various Python versions performed, so I wrote a simple version that takes **19.025** seconds on Python 2.7.3, **12.436** seconds on Jython 2.5.3, and **1.392** seconds with PyPy 1.9.0:

from random import randrange import time n = 10000000 l = list(xrange(n)) t0 = time.time() i = n while i > 1: j = randrange(i) i -= 1 l[i], l[j] = l[j], l[i] print 'took %.3f seconds' % (time.time() - t0)

A version using Numpy takes **3.273** seconds:

import numpy as np import time a = np.arange(10000000) t0 = time.time() np.random.shuffle(a) print 'took %.3f seconds' % (time.time() - t0)

## Ruby

Curious what Ruby would look like, I tested two versions. The first implements shuffle in "pure" Ruby, taking **149.975** seconds on Ruby 1.8.7, **25.086** seconds in Ruby 1.9.3, **8.054** seconds on MacRuby 0.12, **7.258** seconds on JRuby 1.7.3, and **4.682** seconds on Rubinius 1.2.4:

def shuffle!(a) size = a.length size.times do |i| r = i + Kernel.rand(size - i) a[i], a[r] = a[r], a[i] end end a = (1..10000000).to_a t0 = Time.new shuffle!(a) t1 = Time.new printf "took %.3f seconds", t1-t0

But since Ruby 1.8.7, Array.shuffle! is a builtin (implemented in C), so lets look at how performance differs. It takes **0.443** seconds in Ruby 1.8.7, **0.554** seconds in Ruby 1.9.3, **0.884** seconds in MacRuby 0.12, **2.170** seconds in JRuby 1.7.3, and **3.479** seconds in Rubinius 1.2.4:

a = (1..10000000).to_a t0 = Time.new a = a.shuffle! t1 = Time.new printf "took %.3f seconds", t1-t0

## Conclusion

C is awesome. PyPy is really fast. Factor can be pretty fast. Numpy should be faster. Jython could be faster. Python isn't very fast at all. Ruby is both blazing fast and slower than snails!

## 7 comments:

Just for giggles, I got the following results in my Debian64 VM with SBCL 1.14*:

[*] PRE and CODE tags were rejected. Sorry if this looks like garbage.

(defun shuffle (arr)

(loop for i from (1- (length arr)) downto 1

do (rotatef (aref arr (random i)) (aref arr i))))

CL-USER> (time (shuffle *toy*))

Evaluation took:

1.065 seconds of real time

1.524095 seconds of total run time (1.524095 user, 0.000000 system)

143.10% CPU

3,161,941,417 processor cycles

0 bytes consed

Taking away the specialization on arrays with AREF, we get a slightly slower version:

(defun shuffle2 (seq)

(loop for i from (1- (length seq)) downto 1

do (rotatef (elt seq (random i)) (elt seq i))))

CL-USER> (time (shuffle2 *toy*))

Evaluation took:

1.405 seconds of real time

2.032127 seconds of total run time (2.032127 user, 0.000000 system)

144.63% CPU

4,165,084,903 processor cycles

28,128 bytes consed

Adding (declaim (ftype (function ((simple-vector))) shuffle)) gets my lisp version down to ~0.6 seconds.

A luajit 2.0 version with no type declarations performs in about 0.63 seconds.

math = require("math")

os = require("os")

local toy = {}

for i=1, 10000000 do

toy[i] = math.random(10000000)

end

function shuffle(seq)

for i=(#seq-1), 1, -1 do

j = math.random(i)

seq[i], seq[j] = seq[j], seq[i]

end

end

You really should mention the python builtin function shuffle when writing about the ruby builtin too. In my case the runtime went down a factor of 3 when using this bultin shuffle function compared to your naive one (with plain Python 2.7.3).

CPython uses a dictionary lookup for module variables. Local variables have an optimized O(1) lookup because it's impossible to add new variables to local scope after a function has been declared.

If I wrap the code in a "def main():" and call it with "main()", then I get a drop from 20 seconds to 17.

Two other minor optimizations are to use range(n) instead of list(xrange(n)), and to make randrange be a local variable ("import random" then before the main loop "randrange = random.randrange"). This turns a large number of module lookup+attribute lookup calls into a single fast local lookup call.

Pypy does this last optimization automatically.

Using the suggestion of halex, random.shuffle on the array takes about 8 seconds in CPython.

@Redline6561: Nice examples!

@halex: I forgot about builtin shuffle, that would make a nice comparison with the builtin Ruby version.

@AndrewDalke: Great points! I wasn't trying to make Python look bad at all, just to show how performance is all over the place. I know there's many versions of optimizations you could do in Python (similar to how I try and speed up the Factor version) including writing it in C to be "fastest".

It's fun that you can usually write C in any language, but it's refreshing to see PyPy and LuaJit allowing C-like performance with more simple looking code.

Thanks!

@Redline6561: Your Lisp code is broken, you need to use (random (1+ i)) instead of (random i) to get an index in the right interval.

Also, please do not use ELT instead of AREF, the idea of using this algorithm on lists is disgusting.

For python, you not need to implement it. It's already implemented under random.shuffle (http://docs.python.org/2/library/random.html#random.shuffle)

And you can use range instead of xrange and then go through and build an array (it's faster)

https://gist.github.com/ehmo/5065450

On my laptoon (2 years old MacBook air) and python 2.7.1 the result is:

took 13.629 secondsPost a Comment