Monday, October 28, 2013

Text Summary

Several months ago, Shlomi Babluki wrote about how to build your own [text] summary tool in response to Yahoo's purchase of Summly. It turns out to be a nice introduction to basic automatic summarization techniques. You can download and read Shlomi's Python implementation on GitHub.

Below, I show the simple text summary implemented in Factor.

tokenization

We need a way to split text into sentences. I chose a simple regular expression, but there are many other approaches which might be useful for handling more complicated sentence patterns, including the Punkt sentence tokenizer used by NLTK.

: split-sentences ( content -- sentences )
    [ blank? ] split-when harvest " " join
    R/ (?<=[.!?]|[.!?][\'"])\s+/ re-split ;

We also need to split text into paragraphs. We simply look for an empty line between blocks of text, and then split each block of text into sentences.

: split-paragraphs ( content -- paragraphs )
    "\n\n" split-subseq [ split-sentences ] map ;

ranking

We score two sentences by a simple formula that counts the number of words they share divided by the total number of words in both sentences. This part could be improved in a variety of ways such as removing stop words or using stemming.

: sentence-score ( sentence1 sentence2 -- n )
    [ [ blank? ] split-when ] bi@
    2dup [ length ] bi@ + [ 2drop 0 ] [
        [ intersect length ] [ 2 / ] bi* /
    ] if-zero ;

We are going to build a map of each sentence to its total score against the other sentences. We iterate all-combinations, calculating the sentence scores and adding it to the score for both sentences.

: sentence-ranks ( paragraphs -- ranks )
    concat 2 all-combinations H{ } clone [
        dup '[
            [ sentence-score ] 2keep
            [ nip _ at+ ]
            [ drop _ at+ ] 3bi
        ] assoc-each
    ] keep ;

We use the sentence rankings to choose a best sentence for each paragraph (ignoring any paragraph with only one sentence):

: best-sentence ( paragraph ranks -- sentence )
    over length 2 < [ 2drop "" ] [
        '[ _ at 0 or ] supremum-by
    ] if ;

summarization

Calculating the summary is as simple as splitting the text into paragraphs, ranking sentences by their total scores, and then using that ranking to choose the best sentence for each paragraph:

: summary ( content -- summary )
    split-paragraphs dup sentence-ranks
    '[ _ best-sentence ] map harvest ;

And for convenience, we can make a word that builds a summary, wraps each paragraph to 72 characters, and prints it out:

: summary. ( content -- )
    summary [ "" like 72 wrap-string print nl ] each ;

For fun, we can try it out, using the wikipedia to summarize the already "simple" article for Programming (removing the links to footnotes that mess up our naive split-sentence algorithm):

IN: scratchpad "simple" [
                   [ "Programming" article. ] with-string-writer
                   R/ \[\d+\]/ "" re-replace summary.
               ] with-language
These instructions come in different languages; they are called
programming languages.

A program is a set of instructions for the computer to follow.

Once a program has been compiled, the instructions in "machine form"
are written into a file that contains a series of numbers that the
computer can understand.

This code for this is available on my GitHub.

Saturday, October 19, 2013

Text-to-Speech

To those of us from a certain decade of computing, the phrase "text-to-speech" reminds us favorably of Dr. Sbaitso. A fun take on that reminiscence is an article titled "Dr. Sbaitso was my only friend".

It would be neat if we could have access to text-to-speech functionality from Factor. And it would be especially neat if it was cross-platform!

We'll start by defining a speak-text word that is a generic word that dispatches on the value of the os object, so we can provide platform-specific implementations:

HOOK: speak-text os ( str -- )

Mac OS X

On Mac OS X, we cheat a bit and just call out to the say command-line tool built into Mac OS X:

M: macosx speak-text
    "say \"%s\"" sprintf try-process ;

We just use the default voice set in System Preferences, but changing the voice is just one of the many options available including adjusting the number of words spoken per minute. For more information on Mac OS X support for speech, read the Speech Synthesis Programming Guide.

Linux

On Linux, text-to-speech is not builtin. Instead, I decided to use the Festival Speech Synthesis System, which includes a command-line tool that can be configured to speak text:

M: linux speak-text
    "festival --tts" utf8 [ print ] with-process-writer ;

In addition to this, you can find a whole host of other features in the Festival manual.

Windows

On Windows, it would probably be cool to bind to the Microsoft Speech API, but that seemed a little bit harder than the quick-and-dirty approach I took.

Support required two commits to the main Factor repository by Doug Coleman and myself:

Those two commits allow us to implement speak-text on Windows:

M: windows speak-text
    translate-tts open-command play-command close-command ;

This code for this is available on my GitHub.

Monday, October 14, 2013

Kahan Summation

The Kahan summation algorithm is a "compensated summation" that "significantly reduces the numerical error in the total obtained by adding a sequence of finite precision floating point numbers, compared to the obvious approach".

Matt Adereth wrote a blog post a few days ago demonstrated some advantages of using Kahan, which I show below using Factor.

The Problem

To demonstrate the problem, let's consider the harmonic series (e.g., 1 + 1/2 + 1/3 + 1/4 + ...) as a series of rational numbers:

: harmonic-ratios ( n -- seq )
    [1,b] [ recip ] map ;

It gives the first n of the harmonic series:

IN: scratchpad 6 harmonic-ratios .
{ 1 1/2 1/3 1/4 1/5 1/6 }

We define a "simple sum" as just adding the numbers in order:

: simple-sum ( seq -- n )
    0 [ + ] reduce ;

The simple sum of the first 10,000 harmonic ratios as a floating point number is:

IN: scratchpad 10,000 harmonic-ratios simple-sum >float .
9.787606036044382

But, if we use floating points instead of ratios to represent the harmonic numbers (e.g., 1.0 + 0.5 + 0.33333333 + 0.25 + ...):

: harmonic-floats ( n -- seq )
    harmonic-ratios [ >float ] map! ;

You can see that an error has been introduced (ending in "48" instead of "82"):

IN: scratchpad 10,000 harmonic-floats simple-sum .
9.787606036044348

If we reverse the sequence and add them from smallest to largest, there is a slightly different error (ending in "86"):

IN: scratchpad 10,000 harmonic-floats reverse simple-sum .
9.787606036044386

The Solution

The pseudocode for the Kahan algorithm can be seen on the Wikipedia page, using a running compensation for lost low-order bits:

function KahanSum(input)
    var sum = 0.0
    var c = 0.0
    for i = 1 to input.length do
        var y = input[i] - c
        var t = sum + y
        c = (t - sum) - y
        sum = t
    return sum

We could translate this directly using local variables to hold state, but instead I tried to make it a bit more concatenative (and possibly harder to read in this case):

: kahan+ ( c sum elt -- c' sum' )
    rot - 2dup + [ -rot [ - ] bi@ ] keep ;

: kahan-sum ( seq -- n )
    [ 0.0 0.0 ] dip [ kahan+ ] each nip ;

You can see both forward and backward errors no longer exist:

IN: scratchpad 10,000 harmonic-floats kahan-sum .
9.787606036044382

IN: scratchpad 10,000 harmonic-floats reverse kahan-sum .
9.787606036044382

Nifty! Thanks, Matt!

Friday, October 11, 2013

Morse Palindromes?

There was a fun post today on Metafilter about the longest palindrome in Morse code being "intransigence" (along with other odd facts).

This jumped out to me for a few reasons:

So, maybe we can confirm that "intransigence" really is the longest palindrome!

palindrome?

The basic definition for a "palindrome" is a word that "reads the same backward as forward". Given that, it's easy to build a first version that checks this:

: palindrome? ( str -- ? ) dup reverse = ;

However, in our tutorial we suggest building a more robust version that normalizes the input to handle palindromes such as "A man, a plan, a canal: Panama.":

: normalize ( str -- str' ) [ Letter? ] filter >lower ;

: palindrome? ( str -- ? ) normalize dup reverse = ;

morse-palindrome?

For our morse code palindrome detector, we need to convert our string to morse code, removing extra spaces that the morse code vocabulary adds between letters, before checking for palindrome-ness:

: normalize-morse ( str -- str' )
    normalize >morse [ blank? not ] filter ;

: morse-palindrome? ( str -- ? )
    normalize-morse dup reverse = ;

The longest word in the dictionary that is a morse palindrome is:

IN: scratchpad "/usr/share/dict/words" ascii file-lines
               [ morse-palindrome? ] filter longest .
"incalescence"

Wait, that isn't "intransigence"!

Well, no, but "incalescence" has the same number of letters (13) and happens to be slightly longer in morse code. So maybe it's a tie, or maybe they should update their trivia!

In fact, there are several "longest" morse palindromes:

IN: scratchpad "/usr/share/dict/words" ascii file-lines
               [ morse-palindrome? ] filter all-longest .
{
    "incalescence"
    "intercentral"
    "predestinate"
    "predestitute"
    "protectorate"
    "Raphaelesque"
    "ultranatural"
}

P.S., it looks like "Raphaelesque" might be the longest morse palindrome by morse code length.

P.P.S., for some reason /usr/share/dict/words doesn't contain the word "intransigence".

Thursday, October 10, 2013

AppleScript

I find myself often developing on Mac OS X. Given all the new technologies and programming languages available, it is sometimes easy to forget about AppleScript, a scripting language that has been available for over twenty years, and reads and writes a bit like "plain English". AppleScript can be run from the command-line using the osascript tool, and edited and run using the "AppleScript Editor" application.

Wouldn't it be cool if we could run AppleScripts from Factor?

Using the NSAppleScript Class Reference, we can see that, basically, we need to create an NSAppleScript initialized from a source string, and then execute the script:

USING: cocoa cocoa.application cocoa.classes kernel ;

: run-apple-script ( str -- )
    [ NSAppleScript -> alloc ] dip
    <NSString> -> initWithSource: -> autorelease
    f -> executeAndReturnError: drop ;

Right now we are ignoring the return values, but supposedly we could figure out how to interpret the return value (which is a NSAppleEventDescriptor) and pass values back from the script to Factor.

You can see that it works by trying to display a simple dialog:

If you're curious to see what else you can do with AppleScript, check out Doug's AppleScripts for iTunes or the MacScripter.net forum.

This is available in the development version of Factor.

Sunday, October 6, 2013

Rock-paper-scissors

The ever-popular game of rock-paper-scissors is a fun game to play with friends and, as it turns out, a fun game to implement in Factor.

We have an open issue to finish some planned changes to multi-methods. While largely a syntax improvement, one of the challenges will be keeping the simple case fast while providing for enhanced dispatch ability. As a way of showing how the current syntax works, I thought I would implement the "rock-paper-scissors" game.

First, we define each type of object in our game:

SINGLETONS: rock paper scissors ;
Note: we are not playing rock-paper-scissors-lizard-spock, which is a bit more complicated and possibly a lot more fun.

Next, we need to determine a winner using multi-methods. You'll notice that our generic method dispatches off the types of two objects. We can use any type here, including predicate classes, but can simply dispatch off the singletons we defined above:

FROM: multi-methods => GENERIC: METHOD: ;

GENERIC: beats? ( obj1 obj2 -- ? )

METHOD: beats? { scissors paper } 2drop t ;
METHOD: beats? { rock scissors } 2drop t ;
METHOD: beats? { paper rock } 2drop t ;
METHOD: beats? { object object } 2drop f ;

Each play of the game will determine the outcome (win, lose, or tie) and print a summary:

: play. ( obj1 obj2 -- )
    {
        { [ 2dup beats? ] [ "WIN" ] }
        { [ 2dup = ] [ "TIE" ] }
        [ "LOSE" ]
    } cond "%s vs. %s: %s\n" printf ;

The computer will choose randomly amongst the possibilities:

: computer ( -- obj )
    { rock paper scissors } random ;

And for fun, we will define words to allow the user to play the computer at the listener:

: rock ( -- ) \ rock computer play. ;

: paper ( -- ) \ paper computer play. ;

: scissors ( -- ) \ scissors computer play. ;

A sample output from a few games:

The code for this is available on my GitHub.

Tuesday, October 1, 2013

Color Support

For a fairly long time, Factor has had support for RGB, HSV, and grayscale color spaces.

More recently, I got interested in colors, even having some fun by building color tab completion. We have now added support for CMYK, HSL, RYB, YIQ, and YUV color spaces. All colors support an alpha channel and conversion to and from RGB colors!

You could say two colors are equal if their RGB components are very, very, very close:

: color= ( color1 color2 -- ? )
    [ >rgba-components 4array ] bi@
    [ 0.00000000000001 ~ ] 2all? ;

Using that, you can write a test to confirm that our round trip through color spaces preserves the original color information:

{ t } [
    COLOR: sky-blue
    >hsva
    >yiqa
    >cmyka
    >yuva
    >hsla
    >ryba
    >rgba
    COLOR: sky-blue color=
] unit-test

This is now available in the development version of Factor!

Note: we also support basic color mixing, with the ability to do linear gradients between colors.