Friday, July 29, 2011

Robohash

A few days ago, I read about Robohash, a website for creating unique images (of robots) from any text. The author was thinking of using it for icons on a forum, but there are probably other use-cases for this. For example, as a variation of the color password hash to help users avoid typos in their passwords or, perhaps, as a visual representation of a user-submitted "secret code".
USING: images.http kernel sequences urls urls.encoding ;
First, we need to create URLs of the form http://robohash.org/YOUR_TEXT, as instructed:
: robohash-url ( str -- url )
    url-encode "http://robohash.org/" prepend >url ;
Next, we would like to support the different "image sets" that Robohash supports.
: (robohash) ( str type -- image )
    [ robohash-url ] [ "set" set-query-param ] bi*
    load-http-image ;
Using this, we can create image loaders for each set (currently three sets: "set1", "set2", and "set3"):
: robohash1 ( str -- image ) "set1" (robohash) ;

: robohash2 ( str -- image ) "set2" (robohash) ;

: robohash3 ( str -- image ) "set3" (robohash) ;
You can try it out and see that it works:
Robohash also supports custom backgrounds, changing image sizes, and varying image formats (e.g., JPG or BMP). Adding support for that is an exercise to the reader.

The code for this is on my Github.

Saturday, July 23, 2011

Majority Vote

A Linear Time Majority Vote Algorithm was invented in 1980 by Bob Boyer and J Strother Moore (inventors of the popular Boyer-Moore string search algorithm). Not seeing this available in Factor, I thought to contribute one.
Note: this is also called the "Moore’s Voting Algorithm".
The algorithm simply looks at each element of the sequence:

Keep a candidate element and a counter (initially unknown and zero, respectively).

As we move across the sequence, look at each element:
  • If the counter is 0: the element is the candidate and the counter is 1.
  • If the counter is not 0: increment if the element is the candidate, decrement if not.
When we are done, the candidate is the majority element, if there is a majority.

Using this specification, we can implement the algorithm:
: majority ( seq -- elt/f )
    [ f 0 ] dip [
        over zero? [ 2nip 1 ] [ 
            pick = [ 1 + ] [ 1 - ] if
        ] if
    ] each zero? [ drop f ] when ;
A few simple tests show that this is working:
[ f ] [ { } majority ] unit-test
[ f ] [ { 1 2 } majority ] unit-test
[ 1 ] [ { 1 1 2 } majority ] unit-test
[ f ] [ { 1 1 2 2 } majority ] unit-test
[ 2 ] [ { 1 1 2 2 2 } majority ] unit-test
This is perhaps not quite idiomatic Factor, can you improve it?

Tuesday, July 19, 2011

One-Liners

Inspired by a blog post about "one-liners" in Clojure, I thought I'd demonstrate a few small pieces of Factor code doing some similar things:

Using the map word, we can apply a "doubling" quotation to each element.
( scratchpad ) { 4 8 15 16 23 42 } [ 2 * ] map .
{ 8 16 30 32 46 84 }
We can easily calculate the sum of a sequence (here the numbers 1 through 1000):
( scratchpad ) 1000 [1,b] sum .
500500
We check if any of a list of words are within a string:
( scratchpad ) { "factor" "concatenative" "stack-based" }
               [ "factor is awesome" subseq? ] any?
You can easily read the file-contents, or file-lines, with a specified encoding (e.g., UTF-8):
( scratchpad ) "/path/to/file.txt" utf8 file-contents
Sing the four verses to the "Happy Birthday" song:
( scratchpad ) 4 [1,b] [
                   "Happy Birthday " write
                   3 = "dear NAME" "to You" ? print
               ] each
Happy Birthday to You
Happy Birthday to You
Happy Birthday dear NAME
Happy Birthday to You
Use filter (or even partition) with some selection criteria:
( scratchpad ) { 49 58 76 82 88 90 } [ 60 > ] filter .
{ 76 82 88 90 }
Use the http.client vocabulary to access a web service, and then the xml vocabulary to parse the result from a string. You could use the json.reader vocabulary to parse JSON responses.
( scratchpad ) "http://search.twitter.com/search.atom?q=factor"
               http-get nip string>xml
Check the HTTP headers to find the version of a server used by a web server:
( scratchpad ) "http://apple.com" http-get drop
               header>> "server" swap at .
"Apache/2.2.3 (Oracle)"
Use infimum and supremum to find the minimum and maximum, respectively, of a list (alternatively, you could use my maximum/minimum functions):
( scratchpad ) { 14 36 -7 46 98 } infimum .
-7

( scratchpad ) { 14 36 -7 46 98 } supremum .
98
Parse a string into groups of two characters, then interpret those as hex values of characters, mapping the output as a string:
( scratchpad ) "474e552773204e6f7420556e6978"
               2 <groups> [ 16 base> ] "" map-as .
"GNU's Not Unix"
Use concurrency.combinators to perform certain tasks in parallel with futures:
( scratchpad ) { 1 0 -1 } [ 2 + ] parallel-map .
{ 3 2 1 }

Friday, July 15, 2011

Detecting Plagiarism

About a month ago, Tom Moertel wrote a simple plagiarism detector in Haskell. I wanted to replicate his functionality using Factor, to contrast the two solutions.

The strategy in this plagiarism detector is fairly simple: calculate "long enough" n-grams that should be fairly unique and see if they are present in a particular piece of "suspect text", converting the common pieces into UPPERCASE so that we can visually see what might be plagiarized.
USING: command-line grouping io io.encodings.utf8 io.files
kernel math math.parser math.ranges namespaces regexp sequences
sets splitting unicode.case unicode.categories unicode.data ;

IN: plagiarism
We can split text (using the grouping vocabulary) into consecutive groups of "n" words:
: n-grams ( str n -- seq )
    [ [ blank? ] split-when harvest ] [ <clumps> ] bi* ;
Given a piece of text suspected to be plagiarized and some sources to compare against, we can compute the common n-grams between the two pieces of text:
: common-n-grams ( suspect sources n -- n-grams )
    [ n-grams ] curry dup [ map concat ] curry bi* intersect ;
For each common n-gram found, we use a regular expression to find the matching part of the suspect text. The regular expression we will use, looks something like a space or start of line followed by our n-gram and ending with a space or end of line, allowing varying whitespace between words, being case insensitive, and ignoring non-letters within a word:
: n-gram>regexp ( seq -- regexp )
    [ [ Letter? not ] split-when "[\\W\\S]" join ] map
    "\\s+" join "(\\s|^)" "(\\s|$)" surround
    "i" <optioned-regexp> ;
The sequences vocabulary contains a change-nth word to modify a particular element in a sequence. We can create a word to modify several elements easily:
: change-nths ( indices seq quot: ( elt -- elt' ) -- )
    [ change-nth ] 2curry each ; inline
Using change-nths and the "n-gram regexp", we can ch>upper each matching portion of text:
: upper-matches ( str regexp -- )
    [ [ [a,b) ] dip [ ch>upper ] change-nths ] each-match ;
Using these building blocks, we can build a simple plagiarism detector:
  1. Compute the common n-grams between suspect and source texts
  2. Create a regular expression for each common n-gram
  3. For each match, change the matching characters to uppercase
: detect-plagiarism ( suspect sources n -- suspect' )
    [ dupd ] dip common-n-grams [
        dupd n-gram>regexp upper-matches
    ] each ;
A "main method" enables this program to run from the command line:
: run-plagiarism ( -- )
    command-line get dup length 3 < [
        drop "USAGE: plagiarism N suspect.txt source.txt..." print
    ] [
        [ rest [ utf8 file-contents ] map unclip swap ]
        [ first string>number ] bi detect-plagiarism print
    ] if ;

MAIN: run-plagiarism
You can see it work by trying to find common 4-grams on some simple text (in this case, I added the word "really"):
( scratchpad ) "this is a really long piece of text"
               { "this is a long piece of text" }
               4 detect-plagiarism print
this is a really LONG PIECE OF TEXT
It's fast enough for small examples, but not that fast for complete novels, particularly when run on texts with many common n-grams. One idea for improving the speed might be to examine the common-n-grams algorithm to return sequences of "n or more". This way, if the text contains a common 7-gram, and you are looking at common 4-grams, then it would have one entry instead of three.

The code for this is on my Github.

Tuesday, July 12, 2011

Concatenative Thinking

I've written about the conciseness of Factor before. Yesterday, I noticed a link to a functional programming tutorial called "Functional Thinking" that was posted two months ago.

The tutorial develops a program for classifying numbers based on the sum of its factors into perfect, abundant, or deficient numbers. The program goes through three improvements, ending up at this "best" solution in Java (using the Functional Java library):

public class FNumberClassifier {

    public boolean isFactor(int number, int potential_factor) {
        return number % potential_factor == 0;
    }

    public List<Integer> factors(final int number) {
        return range(1, number+1).filter(new F<Integer, Boolean>() {
            public Boolean f(final Integer i) {
                return number % i == 0;
            }
        });
    }

    public int sum(List<Integer> factors) {
        return factors.foldLeft(fj.function.Integers.add, 0);
    }

    public boolean isPerfect(int number) {
        return sum(factors(number)) - number == number;
    }

    public boolean isAbundant(int number) {
        return sum(factors(number)) - number > number;
    }

    public boolean isDeficient(int number) {
        return sum(factors(number)) - number < number;
    }
}
While the proffered solution is much (much) better than the original "imperative" solution, I thought I would show what it could look like in Factor:
: factor? ( m n -- ? )
    mod zero? ;

: factors ( n -- seq )
    dup [1,b] [ factor? ] with filter ;

: perfect? ( n -- ? )
    [ factors sum ] [ - ] [ = ] tri ;

: abundant? ( n -- ? )
    [ factors sum ] [ - ] [ > ] tri ;

: deficient? ( n -- ? )
    [ factors sum ] [ - ] [ < ] tri ;
What do you think?

Sunday, July 10, 2011

Substrings

One year ago, I wrote about some new combinatoric functions that I wrote. I had a recent need for a couple of "substring" functions. Since I couldn't find them in Factor's standard library, I thought I would contribute these:

all-subseqs

Our first word takes a sequence, and then returns all (consecutive) subsequences that can be found. In Factor, the grouping vocabulary provides a clumping feature that splits a sequence into overlapping, fixed-length, subsequences. We can use this to find clumps of every possible length to solve this problem:
USING: grouping kernel math.ranges sequences ;

: all-subseqs ( seq -- seqs )
    dup length [1,b] [ <clumps> ] with map concat ;
You can see how this works:
( scratchpad ) "abcd" all-subseqs .
{ "a" "b" "c" "d" "ab" "bc" "cd" "abc" "bcd" "abcd" }
Note: we specifically don't include the "empty" string in the results.

longest-subseq

Several algorithms and implementations can be found for the longest common substring problem. Basically, we want a word that returns the longest (consecutive) substring that is common between two strings.

Using locals, I translated the Python solution:

USING: arrays kernel locals math math.ranges sequences ;

:: longest-subseq ( seq1 seq2 -- subseq )
    seq1 length :> len1
    seq2 length :> len2
    0 :> n!
    0 :> end!
    len1 1 + [ len2 1 + 0 <array> ] replicate :> table
    len1 [1,b] [| x |
        len2 [1,b] [| y |
            x 1 - seq1 nth
            y 1 - seq2 nth = [
                y 1 - x 1 - table nth nth 1 + :> len
                len y x table nth set-nth
                len n > [ len n! x end! ] when
            ] [ 0 y x table nth set-nth ] if
        ] each
    ] each end n - end seq1 subseq ;
Below, you can see how it works:
( scratchpad ) "abc" "def" longest-subseq .
""

( scratchpad ) "abcd" "abcde" longest-subseq .
"abcd"
Note: don't confuse this with the longest common subsequence problem (see substring vs. subsequence for more details), which is implemented in Factor by the lcs vocabulary.

Thursday, July 7, 2011

99 Bottles

The 99 Bottles Of Beer project has an entry for Factor. Unfortunately, it's for version 0.83 (latest released version is 0.94) and some minor changes have come into the language since then. Below, I contribute an updated version:

USING: formatting io kernel math math.ranges sequences ;

: verse ( n -- )
    dup "%d bottles of beer on the wall, " printf
    dup "%d bottles of beer.\n" printf
    "Take one down and pass it around, " write
    1 - "%d bottles of beer on the wall.\n" printf ;

: verse-1 ( -- )
    "1 bottle of beer on the wall, " write
    "1 bottle of beer." print
    "Take one down and pass it around, " write
    "no more bottles of beer on the wall." print ;

: verse-0 ( -- )
    "No more bottles of beer on the wall, " write
    "no more bottles of beer." print
    "Go to the store and buy some more, " write
    "99 bottles of beer on the wall." print ;

: 99bottles ( -- )
    99 2 [a,b] [ verse ] each verse-1 verse-0 ;

Shorter?

We can shorten this a bit by taking a few poetic liberties with the song. While doing that, we can make it flexible to allow any (positive) number of bottles:

USING: formatting io kernel math math.ranges sequences ;

: verse ( n -- )
    dup "%d bottles of beer on the wall, " printf
    dup "%d bottles of beer.\n" printf
    "Take one down and pass it around, " write
    1 - "%d bottles of beer on the wall.\n" printf ;

: last-verse ( -- )
    "Go to the store and buy some more, " write
    "no more bottles of beer on the wall!" print ;

: bottles ( n -- )
    1 [a,b] [ verse ] each last-verse ;

This solution is similar to the one posted on the Rosetta Code project.

Longer!

You might notice that a lot of the text is duplicative, so perhaps we can improve this solution by factoring out parts of the text into small and reusable functions.

USING: combinators formatting io kernel math sequences ;

: #bottles ( n -- str )
    {
        { 1 [ "1 bottle" ] }
        { 0 [ "no more bottles" ] }
        [ "%d bottles" sprintf ]
    } case " of beer" append ;

: on-the-wall ( n -- )
    #bottles dup "%s on the wall, %s.\n" printf ;

: take-one-down ( n -- )
    "Take one down and pass it around, " write
    #bottles "%s on the wall.\n" printf ;

: take-bottles ( n -- )
    [ dup zero? ] [
        [ on-the-wall ] [ 1 - dup take-one-down ] bi
    ] until on-the-wall ;

: go-to-store ( n -- )
    "Go to the store and buy some more, " write
    #bottles "%s on the wall.\n" printf ;

: bottles ( n -- )
    [ take-bottles ] [ go-to-store ] bi ;

It's a bit longer than the original and, in several ways, not as easy to understand. If we wanted to reuse this functionality elsewhere, it's a clear win. But, if we want to simply generate the "bottles song", perhaps the first or second way is better.

Extra Credit

For fun, I thought we could use our computer to sing to us (if you're using Mac OS X):

( scratchpad ) [ 99bottles ] with-string-writer
               "say \"%s\"" sprintf try-process

The code for this is on my Github.

Monday, July 4, 2011

Fourth of July

Today is Independence Day in the United States. Several lists of fantastic facts are being shared, including that today is the "biggest beer-selling holiday of the year". However, while not as fantastic, here are a few fun facts about today that can be derived using Factor.

Using the calendar vocabulary, we can create a timestamp object for today's date:

( scratchpad ) 2011 7 4 <date> .
T{ timestamp
    { year 2011 }
    { month 7 }
    { day 4 }
    { gmt-offset T{ duration { hour -7 } } }
}

We can see that today is the 185th day of the year:

( scratchpad ) 2011 7 4 <date> day-of-year .
185

Today is also the 27th week of the year (using weeks starting on Sunday - there might be a different answer using week-of-year-monday:

( scratchpad ) 2011 7 4 <date> week-of-year-sunday .
27

This year, July 4th falls on a Monday, but its a different day of the week each year. We can see which days of the week Independence Day falls on over the last few years:

( scratchpad ) 2011 2000 [a,b] [
                   dup 7 4 <date> day-name "%s: %s\n" printf
               ] each
2011: Monday
2010: Sunday
2009: Saturday
2008: Friday
2007: Wednesday
2006: Tuesday
2005: Monday
2004: Sunday
2003: Friday
2002: Thursday
2001: Wednesday
2000: Tuesday

And, finally, the most recent version of Factor (0.94) was released on September 18, 2010, which was 289 days ago.

( scratchpad ) 2011 7 4 <date> 2010 9 18 <date>
               time- duration>days .
289