Re: Factor: June 2014

Wednesday, June 25, 2014

Quicksort

Sorting algorithms are a frequent element to computer science education, conversation amongst programmers, and job interviews. There are many different versions with varying tradeoffs of performance and technique.

I noticed that Rosetta Code has a page on Quicksort implementations. I thought it might make a nice example of translating pseudocode to Factor.

simple quicksort

The "simple quicksort algorithm" has the following pseudocode:

function quicksort(array)
    less, equal, greater := three empty arrays
    if length(array) > 1  
        pivot := select any element of array
        for each x in array
            if x < pivot then add x to less
            if x = pivot then add x to equal
            if x > pivot then add x to greater
        quicksort(less)
        quicksort(greater)
        array := concatenate(less, equal, greater)

We can copy it verbatim using the ability to have named local variables:

:: quicksort ( seq -- sorted-seq )
    seq length 1 > [
        V{ } clone :> less
        V{ } clone :> equal
        V{ } clone :> greater
        seq first :> pivot
        seq [| x |
            x pivot <=> {
                { +lt+ [ x less push ] }
                { +eq+ [ x equal push ] }
                { +gt+ [ x greater push ] }
            } case
        ] each
        less quicksort equal greater quicksort 3append
    ] [ seq ] if ;

Even though local variables can be convenient, we discourage using them if library words or simpler concepts can express the same logic. Noticing that this partitions the sequence, and then joins the parts, we can make it a bit shorter using some available library words:

: quicksort ( seq -- sorted-seq )
    dup empty? [
      unclip [
          '[ _ before? ] partition [ quicksort ] bi@
      ] keep prefix append
    ] unless ;

Neither of these is particularly fast, since they involve the creation of a lot of temporary sequences. There is a better (meaning faster and not really more complex) version available.

better quicksort

The "better quicksort algorithm" is an in-place version that uses swaps to move items into a sorted order. It has the following pseudocode:

function quicksort(array)
    if length(array) > 1
        pivot := select any element of array
        left := first index of array
        right := last index of array
        while left ≤ right
            while array[left] < pivot
                left := left + 1
            while array[right] > pivot
                right := right - 1
            if left ≤ right
                swap array[left] with array[right]
                left := left + 1
                right := right - 1
        quicksort(array from first index to right)
        quicksort(array from left to last index)

We can take a similar translation approach to the first example (using some unsafe words to avoid bounds-checking and mutable local variables) to create this version:

:: (quicksort) ( seq from to -- )
    from to < [
        from to + 2/ seq nth-unsafe :> pivot
        from :> left!
        to :> right!

        [ left right <= ] [
            [ left seq nth-unsafe pivot before? ]
            [ left 1 + left! ] while
            [ right seq nth-unsafe pivot after? ]
            [ right 1 - right! ] while
            left right <= [
                left right seq exchange-unsafe
                left 1 + left!
                right 1 - right!
            ] when
        ] while

        seq from right (quicksort)
        seq left to (quicksort)
    ] when ; inline recursive

: quicksort ( seq -- )
    0 over length 1 - (quicksort) ;

This is faster, although about 3x slower than our current merge sort algorithm. There are probably ways we could make it faster (one I noticed and filed an issue to track that also makes merge sort faster).

I have committed a version of this in the sorting.quick vocabulary that I hope to use for faster in-place sorting in the standard library.

Sunday, June 22, 2014

World Cup

Many people are watching the FIFA World Cup 2014 that is going on right now in Brazil. A few days ago, someone posted a gist for following the World Cup in six lines of Python 3. Several people tried to improve it, down to four lines, then down to one or two lines of code.

Without worrying too much about lines of code, here is something similar in Factor.

: worldcup. ( -- )
    "http://worldcup.sfg.io/matches" http-get nip json>
    [ "status" of "completed" = ] filter
    [
        [ "home_team" of ] [ "away_team" of ] bi
        [ [ "country" of ] [ "goals" of ] bi ] bi@
        "%s %s x %s %s\n" printf
    ] each ;

And if you run it, you'll get something like this:

IN: scratchpad worldcup.
Brazil 3 x Croatia 1
Mexico 1 x Cameroon 0
Spain 1 x Netherlands 5
Chile 3 x Australia 1
Colombia 3 x Greece 0
Ivory Coast 2 x Japan 1
Uruguay 1 x Costa Rica 3
England 1 x Italy 2
Switzerland 2 x Ecuador 1
France 3 x Honduras 0
Argentina 2 x Bosnia and Herzegovina 1
Iran 0 x Nigeria 0
Germany 4 x Portugal 0
Ghana 1 x USA 2
Belgium 2 x Algeria 1
Russia 1 x Korea Republic 1
Brazil 0 x Mexico 0
Cameroon 0 x Croatia 4
Spain 0 x Chile 2
Australia 2 x Netherlands 3
Colombia 2 x Ivory Coast 1
Japan 0 x Greece 0
Uruguay 2 x England 1
Italy 0 x Costa Rica 1
Switzerland 2 x France 5
Honduras 1 x Ecuador 2
Argentina 1 x Iran 0
Nigeria 1 x Bosnia and Herzegovina 0
Germany 2 x Ghana 2

Extra Credit

If we wanted to engineer this a bit more, we could start adding to the example.

First, we could define a tuple class to hold the result of each game. This isn't really necessary, but it can be nice to see all the fields that are available, and to represent it as an object rather than just a hashtable:

TUPLE: game home_team home_team_events home_team_tbd
away_team away_team_events away_team_tbd winner match_number
datetime location status ;

Then we could get all the game results as tuples, using from-slots to convert from an array of hashtable of attributes:

: worldcup ( -- games )
    "http://worldcup.sfg.io/matches" http-get nip json>
    [ game from-slots ] map ;

Next, having fun with colors, we use character styles to print the winner in bold green text.

CONSTANT: winner-style H{
    { foreground COLOR: MediumSeaGreen }
    { font-style bold }
}

And then, using more code than is probably necessary, we print out each team, making sure to format the winner using the style we just defined (using locals for convenience):

: game. ( game -- )
    [let
        [ home_team>> ] [ away_team>> ] [ winner>> ] tri
            :> ( home away winner )

        home "country" of dup winner =
        [ winner-style format ] [ write ] if bl
        home "goals" of number>string write

        " x " write

        away "country" of dup winner =
        [ winner-style format ] [ write ] if bl
        away "goals" of number>string write nl
    ] ;

We want to see the completed games, so we can make a word to filter the list of games.

: completed-games ( games -- games' )
    [ status>> "completed" = ] filter ;

Finally, putting all this together, we make one word to print out all the completed games:

: worldcup. ( -- )
    worldcup completed-games [ game. ] each ;

The code for this is on my GitHub.

Friday, June 13, 2014

Filename Sanitization

I came across the Zaru project that provides filename sanitization for Ruby. You can learn a bit about filenames reading the article on Wikipedia. I thought it might be a nice feature to implement something like this for Factor.

The rules for sanitization are relatively simple, so I will list and then implement each one:

1. Certain characters generally don't mix well with certain file systems, so we filter them:

: filter-special ( str -- str' )
    [ "/\\?*:|\"<>" member? not ] filter ;

2. ASCII control characters (0x00 to 0x1f) are not usually a good idea, either:

: filter-control ( str -- str' )
    [ control? not ] filter ;

3. Unicode whitespace is trimmed from the beginning and end of the filename and collapsed to a single space within the filename:

: filter-blanks ( str -- str' )
    [ blank? ] split-when harvest " " join ;

4. Certain filenames are reserved on Windows and are filtered (substituting a "file" placeholder name):

: filter-windows-reserved ( str -- str' )
    dup >upper {
        "CON" "PRN" "AUX" "NUL" "COM1" "COM2" "COM3" "COM4"
        "COM5" "COM6" "COM7" "COM8" "COM9" "LPT1" "LPT2" "LPT3"
        "LPT4" "LPT5" "LPT6" "LPT7" "LPT8" "LPT9"
    } member? [ drop "file" ] when ;

5. Empty filenames are not allowed, replaced instead with file:

: filter-empty ( str -- str' )
    [ "file" ] when-empty ;

6. Filenames that begin with only a "dot" character are replaced with file:

: filter-dots ( str -- str' )
    dup first CHAR: . = [ "file" prepend ] when ;

Putting it all together, and requiring the filename to be no more than 255 characters:

: sanitize-path ( path -- path' )
    filter-special
    filter-control
    filter-blanks
    filter-windows-reserved
    filter-empty
    filter-dots
    255 short head ;

The code for this (and some tests) is on my GitHub.

Tuesday, June 10, 2014

Swift Ranges

Looking at the documentation for the Swift programming language recently released by Apple, I noticed they have support for integer ranges, similar to how the range objects work in Factor.

In Swift, you can get a range of the integers 2 through 6 by doing 2...6 and the integers 2 through 5 by doing 2..6. Notice the use of three or two dots to indicate whether the range includes the last number, or not, respectively.

I thought it would be fun to implement a similar syntax for Factor.

First, you can show that:

IN: scratchpad 2 6 [a,b) >array .
{ 2 3 4 5 }

IN: scratchpad 2 6 [a,b] >array .
{ 2 3 4 5 6 }

Similar to how we implemented fat arrows (also known as "pair rockets" or "hash rockets"), we can define the following syntax words:

SYNTAX: .. dup pop scan-object [a,b) suffix! ;

SYNTAX: ... dup pop scan-object [a,b] suffix! ;

And then use them:

IN: scratchpad 2 .. 6 >array .
{ 2 3 4 5 }

IN: scratchpad 2 ... 6 >array .
{ 2 3 4 5 6 }

Comparing k-NN in Factor

Recently a pair of blog posts compared implementations of a k-nearest neighbour (k-NN) classifier in F# and OCaml. Subsequently an implementation showing performance in Rust got my attention and I thought it might be nice to demonstrate a version in Factor.

The first OCaml version is 30 lines of code and takes 21 seconds on my laptop:

$ sloccount classifyDigits.ml
ml:              30 (100.00%)

$ time ./classifyDigits
Percentage correct:94.400000

real 0m21.292s
user 0m21.152s
sys 0m0.120s

The second OCaml version is 47 lines of code and takes 12 seconds:

$ sloccount classifyDigitsArray.ml
ml:              47 (100.00%)

$ time ./classifyDigitsArray 
Percentage correct:94.400000

real 0m12.563s
user 0m12.434s
sys 0m0.120s

Note: I couldn't get the parallel version to run, but would assume it to have the same 2x speedup that the author saw.

Simple

It is often useful to start with the simplest possible code before trying to optimize for performance. I decided to parse the training and validation files (containing comma-separated values, the first of which is the label and the subsequent values are observations) into an array of arrays.

: slurp-file ( path -- {pixels,label} )
    ascii file-lines rest [
        "," split [ string>number ] map unclip 2array
    ] map ;

: classify ( training pixels -- label )
    '[ first _ distance ] infimum-by second ;

: k-nn ( -- )
    "~/trainingsample.csv" slurp-file
    "~/validationsample.csv" slurp-file
    [ [ first2 [ classify ] [ = ] bi* ] with count ]
    [ length ] bi / 100.0 * "Percentage correct: %.1f\n" printf ;

You can see that it produces the desired output of 94.4% correct, and takes about 40 seconds on my laptop.

IN: scratchpad gc [ k-nn ] time
Percentage correct: 94.4
Running time: 40.283777984 seconds

Not too bad for 11 lines of simple code, but slower than it could be. Much of the performance penalty in this version is due to the large amount of generic dispatch, which is something we hope to reduce in future versions of Factor.

Faster

I noticed all the observed values were in the range [0-255], so thought a simple speedup might be to store them in a byte-array, and instead of using the builtin distance word, make my own that specializes on byte-arrays.

: slurp-file ( path -- {pixels,label} )
    ascii file-lines rest [
        "," split [ string>number ] B{ } map-as unclip 2array
    ] map ;

: distance ( x y -- z )
    { byte-array byte-array } declare 0 [ - sq + ] 2reduce ;

: classify ( training pixels -- label )
    '[ first _ distance ] infimum-by second ;

: k-nn ( -- )
    "~/trainingsample.csv" slurp-file
    "~/validationsample.csv" slurp-file
    [ [ first2 [ classify ] [ = ] bi* ] with count ]
    [ length ] bi / 100.0 * "Percentage correct: %f\n" printf ;

With that simple change, we get 7x faster than our previous version and roughly as fast as the fastest parallel OCaml version!

IN: scratchpad gc [ k-nn ] time
Percentage correct: 94.400000
Running time: 5.708627884 seconds

The Rust version requires a nightly build and I haven't had a chance to test it, but I assume it is a bit faster, and discussions on r/rust, r/programming, and Hacker News show some fast versions in C++ and D as well.

The code for this is in my GitHub.

Thursday, June 5, 2014

2 + 2 = 5

There is an old programmer joke that wonders if 2 + 2 = 5 for very large values of 2 (someone even made it into a fun T-shirt).

Well, a challenge on StackExchange to write a program that makes 2 + 2 = 5 caught my eye. I wondered what a solution might look like in Factor.

If you run this code:

IN: scratchpad << "\x32" create-in 5/2 define-constant >>

Then... whoa!

IN: scratchpad 2 2 + .
5

As it turns out, a little bit ago I noticed that you can redefine numbers this way and filed a bug to start a conversation about this "feature".

This exploits the parser, particularly the parse-datum word which searches a token for an already defined word, then if not found, tries to parse it as a number. Usually, we disallow words from being defined by a number using scan-word-name, but that doesn't prevent you from doing it yourself as in the example above.

P.S., in the spirit of the Haskell solution (and anyone else that craves infix notation):

CONSTANT: 2+2 5

IN: scratchpad 2+2 .
5

Monday, June 2, 2014

Pagination

Most of you have used the pagination on various websites, usually in the context of search results or forum posts. I thought it would be fun to build a simple "paginator", using Factor.

For example, if you are on page 23 of 28 total pages, you might see something like this, where you show the selected page and other pages that you can quickly link to:

<< 1 2 ... 21 22 [23] 24 25 ... 27 28 >>

Creating a specification from this, our goal will be to show:

the first two pages
the selected page (with two pages before and after)
the last two pages

Using the output>array smart combinator (and lexical variables), we can generate a sequence of page numbers, filtered to make sure we only allow valid page numbers between 1 and #pages:

:: pages-to-show ( page #pages -- seq )
    [
        1 2 page {
            [ 2 - ]
            [ 1 - ]
            [ ]
            [ 1 + ]
            [ 2 + ]
        } cleave #pages [ 1 - ] keep
    ] output>array members
    [ 1 #pages between? ] filter ;

Some unit tests demonstrate that this works for our "spec" pretty well:

{ { 1 2 3 99 100 } } [ 1 100 pages-to-show ] unit-test
{ { 1 2 21 22 23 24 25 27 28 } } [ 23 28 pages-to-show ] unit-test
{ { 1 2 3 } } [ 1 3 pages-to-show ] unit-test

Lastly, we can split the page numbers to display ellipsis on gaps, and print something like our original goal above:

:: pages-to-show. ( page #pages -- )
    page #pages pages-to-show
    [ swap - 1 = ] monotonic-split { f } join
    [
        [
            [ number>string ]
            [ page = [ "[" "]" surround ] when ] bi
        ] [ "..." ] if*
    ] map " " join "<< " " >>" surround print ;

See, it works!

IN: scratchpad 1 100 pages-to-show.
<< [1] 2 3 ... 99 100 >>

IN: scratchpad 23 28 pages-to-show.
<< 1 2 ... 21 22 [23] 24 25 ... 27 28 >>

IN: scratchpad 1 3 pages-to-show.
<< [1] 2 3 >>

Using this in a web application is left as an exercise for the reader, although it might be nice to create a furnace.pagination vocabulary that automatically handles this in our web framework.

You can find this code on my GitHub.