Monday, August 9, 2010

Anagrams

Dave Thomas, one of the Pragmatic Programmers, has developed a series of exercises on his site codekata.com. The idea behind it is based on the concept that to get good at something you should "practice, practice, practice". Malcolm Gladwell calls this the 10,000 Hour Rule in his book Outliers.

While browsing the site, I came across Kata Six: Anagrams. Below is my attempt in Factor.

If you want to avoid spoilers and attempt this yourself, you might not want to read the rest of this post.

First, some preliminary imports and a namespace:

USING: arrays ascii assocs fry io.encodings.ascii io.files
kernel math memoize sequences sorting strings ;

IN: anagrams

One way to check if two words are anagrams is to sort their letters and compare. For example, "listen" and "silent" are anagrams of each other (i.e., when sorted, their letters are both "eilnst").

We will use this approach to take a list of words and create a mapping of their sorted letters to a list of words that are anagrams of each other. After we do that, we'll filter the map to only have words that have anagrams (where two or more words share the same mapping).

: (all-anagrams) ( seq assoc -- )
    '[ dup natural-sort >string _ push-at ] each ;

: all-anagrams ( seq -- assoc )
    H{ } clone [ (all-anagrams) ] keep
    [ nip length 1 > ] assoc-filter ;

You can see it in action:

( scratchpad ) { "listen" "silent" "orange" } all-anagrams .
H{ { "eilnst" V{ "listen" "silent" } } }

Now that we have that, we need a word list. The link on the original blog post no longer works, but most systems come with a dictionary, so we will use that (making all the words lowercase so that we can compare in a case-insensitive way).

MEMO: dict-words ( -- seq )
    "/usr/share/dict/words" ascii file-lines [ >lower ] map ;

Given a list of dictionary words, we can calculate all anagrams:

MEMO: dict-anagrams ( -- assoc )
    dict-words all-anagrams ;

On my MacBook Pro, I see 234,936 words and 15,048 groups of anagrams. Using these, we can write a word to look for anagrams by checking the dictionary.

: anagrams ( str -- seq/f )
    >lower natural-sort >string dict-anagrams at ;

I chose to return f if no anagrams are found. You can try it out:

( scratchpad ) "listen" anagrams .
V{ "enlist" "listen" "silent" "tinsel" }

( scratchpad ) "banana" anagrams .
f

The blog goes further and asks a couple questions:

  • What sets of anagrams contain the most words?
  • What are the longest words that are anagrams?

Both of these share a common process, which is to take a sequence and filter it for the elements that have the longest length:

: longest ( seq -- subseq )
    dup 0 [ length max ] reduce '[ length _ = ] filter ;

This works pretty simply:

( scratchpad ) { "a" "ab" "abc" "abcd" "hjkl" } longest .
{ "abcd" "hjkl" }

Now we can write the words to answer those two questions:

: most-anagrams ( -- seq )
    dict-anagrams values longest ;

: longest-anagrams ( -- seq )
    dict-anagrams [ keys longest ] keep '[ _ at ] map ;

The answer is? The set of anagrams containing "groan" is the most (10 words). And two anagrams are tied for longest: "pneumohydropericardium/hydropneumopericardium" and "cholecystoduodenostomy/duodenocholecystostomy". Wouldn't you know, they'd be medical words...

The code for this is available on my Github.

1 comment:

kobi said...

elegant simple code.
good solution! well done