Monday, May 14, 2012

Finding Duplicate Numbers

One of my favorite interview questions goes something like this:

Given an array of 1001 elements which contains integers from 1 to 1000 inclusive. The numbers are randomly stored in the array. Only one number repeats itself. The candidate has to come up with an efficient solution for finding that duplicate given that you can access the elements of an array only once i.e., you can read the elements of the array only once.

I thought I would show some approaches to solving this, using both Python and Factor.

Numbers

First, we need to make a randomized array of our numbers (with one duplicate):

Python:

```def make_numbers(n):
seq = [n] + range(1, 1001)
random.shuffle(seq)
return seq```

Factor:

```: make-numbers ( n -- seq )
[ 1000 [1,b] ] dip suffix randomize ;```

It's worth pointing out that Factor has many "builtin" words that can be helpful for solving this, making the shortest solution:

```IN: scratchpad 12 make-numbers duplicates first .
12```

Enumeration

While this solution is not linear, it is often the "obvious" first way to solve this problem. It goes something like this: for each element in the list, check if duplicated by any other element:

Python:

```def find_duplicate(seq):
for i, m in enumerate(seq):
for n in seq[i+1:]:
if m == n:
return m
return False```

Factor:

```: find-duplicate ( seq -- n )
dup '[ 1 + _ index-from ] find-index nip ;```

Sets

Our first "linear" solution uses sets to track which elements we've seen, stopping when we encounter our first duplicate. We will gloss over the extra memory and time required to maintain the set.

Python:

```def find_duplicate(seq):
d = set()
for m in seq:
if m in d:
return m
return False```

Factor:

```: find-duplicate ( seq -- n )
f fast-set '[ _ ?adjoin not ] find nip ;
```

Both versions use a generic hash-set. Knowing that our data is numeric values, we can use a bit-set and gain some performance:

```: find-duplicate ( seq -- n )
dup length <bit-set> '[ _ ?adjoin not ] find nip ;```

Number Theory

Taking advantage of the fact that our numbers are from 1 to 1000, we can compute the `sum` of the numbers, and subtract the expected total (using the formula `(n * n+1) / 2`) to find the duplicate:

Python:

```def find_duplicate(seq):
n = len(seq) - 1
return sum(seq) - (n*(n+1))/2```

Factor:

```: find-duplicate ( seq -- n )
[ sum ] [ length dup 1 - * 2 / ] bi - ;```

What do you think? Any other good, fun, or unusual ways to solve this problem?

Update: Zeev pointed out in the comments that another way would be to XOR all the values in our sequence and the numbers 1 to 1000:

Python:

```def find_duplicate(seq):
n1 = 0
for x in seq:
n1 ^= x
n2 = 0
for x in range(1, 1001):
n2 ^= x
return n1 ^ n2```

Factor:

```: find-duplicate ( seq -- n )
1000 [1,b] [ 0 [ bitxor ] reduce ] bi@ bitxor ;```

Zeev said...

xor is commutative. Let n1 be the result of xor'ing the sequence without a duplicate (xor a range). Let n2 be the result of xor'ing the sequence of numbers of the same range in arbitrary order with a single duplicate. Then duplicate is n1 ^ n2.

Dave said...

It turns out the xor of 1 to x is x for any x which is a multiple of 4.
1000 is a multiple of 4, so the xor of 1-1000 is 1000.
Which saves 999 xor operations.

mrjbq7 said...

@Dave: I hadn't noticed that - brilliant simplification!