Monday, October 28, 2013

Text Summary

Several months ago, Shlomi Babluki wrote about how to build your own [text] summary tool in response to Yahoo's purchase of Summly. It turns out to be a nice introduction to basic automatic summarization techniques. You can download and read Shlomi's Python implementation on GitHub.

Below, I show the simple text summary implemented in Factor.

tokenization

We need a way to split text into sentences. I chose a simple regular expression, but there are many other approaches which might be useful for handling more complicated sentence patterns, including the Punkt sentence tokenizer used by NLTK.

: split-sentences ( content -- sentences )
    [ blank? ] split-when harvest " " join
    R/ (?<=[.!?]|[.!?][\'"])\s+/ re-split ;

We also need to split text into paragraphs. We simply look for an empty line between blocks of text, and then split each block of text into sentences.

: split-paragraphs ( content -- paragraphs )
    "\n\n" split-subseq [ split-sentences ] map ;

ranking

We score two sentences by a simple formula that counts the number of words they share divided by the total number of words in both sentences. This part could be improved in a variety of ways such as removing stop words or using stemming.

: sentence-score ( sentence1 sentence2 -- n )
    [ [ blank? ] split-when ] bi@
    2dup [ length ] bi@ + [ 2drop 0 ] [
        [ intersect length ] [ 2 / ] bi* /
    ] if-zero ;

We are going to build a map of each sentence to its total score against the other sentences. We iterate all-combinations, calculating the sentence scores and adding it to the score for both sentences.

: sentence-ranks ( paragraphs -- ranks )
    concat 2 all-combinations H{ } clone [
        dup '[
            [ sentence-score ] 2keep
            [ nip _ at+ ]
            [ drop _ at+ ] 3bi
        ] assoc-each
    ] keep ;

We use the sentence rankings to choose a best sentence for each paragraph (ignoring any paragraph with only one sentence):

: best-sentence ( paragraph ranks -- sentence )
    over length 2 < [ 2drop "" ] [
        '[ _ at 0 or ] supremum-by
    ] if ;

summarization

Calculating the summary is as simple as splitting the text into paragraphs, ranking sentences by their total scores, and then using that ranking to choose the best sentence for each paragraph:

: summary ( content -- summary )
    split-paragraphs dup sentence-ranks
    '[ _ best-sentence ] map harvest ;

And for convenience, we can make a word that builds a summary, wraps each paragraph to 72 characters, and prints it out:

: summary. ( content -- )
    summary [ "" like 72 wrap-string print nl ] each ;

For fun, we can try it out, using the wikipedia to summarize the already "simple" article for Programming (removing the links to footnotes that mess up our naive split-sentence algorithm):

IN: scratchpad "simple" [
                   [ "Programming" article. ] with-string-writer
                   R/ \[\d+\]/ "" re-replace summary.
               ] with-language
These instructions come in different languages; they are called
programming languages.

A program is a set of instructions for the computer to follow.

Once a program has been compiled, the instructions in "machine form"
are written into a file that contains a series of numbers that the
computer can understand.

This code for this is available on my GitHub.

No comments: