Monday, March 28, 2011

Unique Hash

Recently, I stumbled onto a blog post from 2009 which discussed a way to generate random-looking strings from a series of unique numeric id's, using PHP. The author uses a base-62 encoding ([0-9][A-Z][a-z]) to convert a number to a unique string using a series of prime numbers to reduce the potential of collisions. Below, I show how it might look in Factor.

First, we implement a simple base-62 encoder:

CONSTANT: CHARS
"0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"

: base62 ( n -- string )
    [ dup 0 > ] [ 62 /mod CHARS nth ] "" produce-as reverse nip ;

Next, we define a series of prime numbers (which should be kept "secret" to make the algorithm hard to predict) chosen to be close to the next highest prime from the golden mean of the number of possible permutations for each string length:

CONSTANT: PRIMES
{ 1 41 2377 147299 9132313 566201239 35104476161 2176477521929 }

Finally, we can implement the "unique hash" function:

:: udihash ( n chars -- string )
    chars PRIMES nth n * 62 chars ^ mod base62
    chars CHAR: 0 pad-head ;

Try it out and see that it produces the same random-looking results as the original author. For example, a 5-character hash of the numbers 1 through 10:

( scratchpad ) 10 [1,b] [ 5 udihash print ] each
cJio3
EdRc6
qxAQ9
TGtEC
5ac2F
huKqI
KE3eL
wXmSO
YrVGR
BBE4U

You can find a discussion on StackOverflow, a similar approach used in the SilverStripe CMS to shorten URLs, and lots of search results probably inspiring (or inspired by) the original blog post.

If you're curious how to select your prime numbers, you can use math.primes to create your own list:

( scratchpad ) CONSTANT: golden-ratio 1.618033988749894848

( scratchpad ) 5 [1,b] [
                   62 swap ^ golden-ratio /i next-prime
               ] map .
{ 41 2377 147299 9132313 566201239 }

The code for this is on my Github.

3 comments:

  1. Wait, why are we using permutations rather than 62^k?

    ReplyDelete
  2. @William You are exactly right, permutations would mean the characters aren't allowed to be re-used, but in this case they are. So instead of "62 swap nPk", we should do "62 swap ^".

    ReplyDelete
  3. I've update the post to reflect that. Thanks for the feedback!

    ReplyDelete