Sunday, July 10, 2011

Substrings

One year ago, I wrote about some new combinatoric functions that I wrote. I had a recent need for a couple of "substring" functions. Since I couldn't find them in Factor's standard library, I thought I would contribute these:

all-subseqs

Our first word takes a sequence, and then returns all (consecutive) subsequences that can be found. In Factor, the grouping vocabulary provides a clumping feature that splits a sequence into overlapping, fixed-length, subsequences. We can use this to find clumps of every possible length to solve this problem:
USING: grouping kernel math.ranges sequences ;

: all-subseqs ( seq -- seqs )
    dup length [1,b] [ <clumps> ] with map concat ;
You can see how this works:
( scratchpad ) "abcd" all-subseqs .
{ "a" "b" "c" "d" "ab" "bc" "cd" "abc" "bcd" "abcd" }
Note: we specifically don't include the "empty" string in the results.

longest-subseq

Several algorithms and implementations can be found for the longest common substring problem. Basically, we want a word that returns the longest (consecutive) substring that is common between two strings.

Using locals, I translated the Python solution:

USING: arrays kernel locals math math.ranges sequences ;

:: longest-subseq ( seq1 seq2 -- subseq )
    seq1 length :> len1
    seq2 length :> len2
    0 :> n!
    0 :> end!
    len1 1 + [ len2 1 + 0 <array> ] replicate :> table
    len1 [1,b] [| x |
        len2 [1,b] [| y |
            x 1 - seq1 nth
            y 1 - seq2 nth = [
                y 1 - x 1 - table nth nth 1 + :> len
                len y x table nth set-nth
                len n > [ len n! x end! ] when
            ] [ 0 y x table nth set-nth ] if
        ] each
    ] each end n - end seq1 subseq ;
Below, you can see how it works:
( scratchpad ) "abc" "def" longest-subseq .
""

( scratchpad ) "abcd" "abcde" longest-subseq .
"abcd"
Note: don't confuse this with the longest common subsequence problem (see substring vs. subsequence for more details), which is implemented in Factor by the lcs vocabulary.

4 comments:

yac said...

To answer the question how terrible code may come up when not knowing about clumping, here's my attempt to find greatest subpalindrome, which is quite slow in comparsion with its counterparts in other languages.

mrjbq7 said...
This comment has been removed by the author.
mrjbq7 said...

I also solved the Greplin longest palindrome challenge. My solution leveraged an "each-subseq" word that applied a quotation to each substring -- a bit like the "all-subseqs" word above.

yac said...

As of naive solution checking all substrings, I used tuple to hold state and checked with `while` from longest substrings, stopping on first palindrome found.

I have to profile these three solutions somethimes.