Thursday, October 28, 2010

Syntax Highlighting

Update: I modified the code examples below to use the new colors.hex vocabulary.

It's a sometimes overlooked feature that Factor contains a syntax highlighting vocabulary that supports a decent number of programming languages. The vocabulary is called xmode, and is used to highlight entries submitted to Factor's pastebin.

I thought it would be fun to use xmode to implement syntax highlighting in the listener. When we're done, it will look something like this:


First, we will need to use several vocabularies:

USING: accessors assocs colors io io.encodings.utf8 io.files
io.styles kernel literals locals math math.parser sequences
xmode.catalog xmode.marker ;

We define named styles that will apply to the tokens that are parsed by the syntax highlighter. For convenience, we will use the colors.hex vocabulary to convert hexadecimal values to color tuples to build the stylesheet. In the stylesheet below, we use the same colors that are used in the pastebin.

CONSTANT: STYLES H{
    { "NULL"     H{ { foreground HEXCOLOR: 000000 } } }
    { "COMMENT1" H{ { foreground HEXCOLOR: cc0000 } } }
    { "COMMENT2" H{ { foreground HEXCOLOR: ff8400 } } }
    { "COMMENT3" H{ { foreground HEXCOLOR: 6600cc } } }
    { "COMMENT4" H{ { foreground HEXCOLOR: cc6600 } } }
    { "DIGIT"    H{ { foreground HEXCOLOR: ff0000 } } }
    { "FUNCTION" H{ { foreground HEXCOLOR: 9966ff } } }
    { "INVALID"  H{ { background HEXCOLOR: ffffcc }
                    { foreground HEXCOLOR: ff0066 } } }
    { "KEYWORD1" H{ { foreground HEXCOLOR: 006699 }
                    { font-style bold } } }
    { "KEYWORD2" H{ { foreground HEXCOLOR: 009966 }
                    { font-style bold } } }
    { "KEYWORD3" H{ { foreground HEXCOLOR: 0099ff }
                    { font-style bold } } }
    { "KEYWORD4" H{ { foreground HEXCOLOR: 66ccff }
                    { font-style bold } } }
    { "LABEL"    H{ { foreground HEXCOLOR: 02b902 } } }
    { "LITERAL1" H{ { foreground HEXCOLOR: ff00cc } } }
    { "LITERAL2" H{ { foreground HEXCOLOR: cc00cc } } }
    { "LITERAL3" H{ { foreground HEXCOLOR: 9900cc } } }
    { "LITERAL4" H{ { foreground HEXCOLOR: 6600cc } } }
    { "MARKUP"   H{ { foreground HEXCOLOR: 0000ff } } }
    { "OPERATOR" H{ { foreground HEXCOLOR: 000000 }
                    { font-style bold } } }
}

The xmode.catalog vocabulary provides support for looking up the type (or "mode") of the file. The xmode.marker vocabulary then provides support for converting each line into a stream of tokens. Each token allows access to a named style. Once we have the name of the appropriate style, we can then look it up and format the output.

Putting that all together, we can implement the highlight. word.

: highlight-tokens ( tokens -- )
    [
        [ str>> ] [ id>> ] bi
        [ name>> STYLES at ] [ f ] if* format
    ] each nl ;

: highlight-lines ( lines mode -- )
    [ f ] 2dip load-mode [
        tokenize-line highlight-tokens
    ] curry each drop ;

:: highlight. ( path -- )
    path utf8 file-lines [
        path over first find-mode highlight-lines
    ] unless-empty ;

You should be able to paste the above code examples into your listener, to try it for yourself.

Monday, October 25, 2010

Trending Github Projects

This weekend someone posted an article describing using Python with YQL to parse Repopular to retrieve a list of popular Github projects. Since mashups are all the rage these days, I thought I would implement this in Factor.


The Yahoo! way

The YQL that was used in the original article is:

use 'http://yqlblog.net/samples/data.html.cssselect.xml'
  as data.html.cssselect;

select * from data.html.cssselect
where url="repopular.com"
  and css="div.pad a"

We can use the http.client to send this query to Yahoo!, parse the returned JSON data using json.reader, and extract the HREFs to the popular projects. We can then filter them for the links which point to Github.

USING: assocs http.client json.reader kernel sequences ;

: the-yahoo-way ( -- seq )
    "http://query.yahooapis.com/v1/public/yql?q=use%20'http%3A%2F%2Fyqlb
log.net%2Fsamples%2Fdata.html.cssselect.xml'%20as%20data.html.cssselect%
3B%20select%20*%20from%20data.html.cssselect%20where%20url%3D%22repopula
r.com%22%20and%20css%3D%22div.pad%20a%22&format=json&diagnostics=true&ca
llback="
    http-get nip json> { "query" "results" "results" "a" }
    [ swap at ] each [ "href" swap at ] map 
    [ "http://github.com" head? ] filter ;

We can run this to see what the current trending projects are on Github.

( scratchpad ) the-yahoo-way [ . ] each
"http://github.com/sinatra/sinatra"
"http://github.com/Sutto/barista"
"http://github.com/pypy/pypy"
"http://github.com/dysinger/apparatus"
"http://github.com/videlalvaro/Thumper"
"http://github.com/alunny/sleight"
"http://github.com/vimpr/vimperator-plugins"

The other way

We can use the html.parser vocabulary to do it another way. Given some knowledge of the HTML returned by Repopular, we can extract the HREFs directly:

USING: accessors assocs html.parser http.client kernel sequences ;

: the-other-way ( -- seq )
    "http://repopular.com" http-get nip parse-html
    [ [ name>> "aside" = ] find drop ]
    [ [ name>> "aside" = ] find-last drop ]
    [ <slice> ] tri
    [ name>> "a" = ] filter
    [ attributes>> "href" swap at ] map
    [ "http://github.com" head? ] filter ;

We can see it produces the same results:

( scratchpad ) the-other-way [ . ] each
"http://github.com/sinatra/sinatra"
"http://github.com/Sutto/barista"
"http://github.com/pypy/pypy"
"http://github.com/dysinger/apparatus"
"http://github.com/videlalvaro/Thumper"
"http://github.com/alunny/sleight"
"http://github.com/vimpr/vimperator-plugins"

Sunday, October 24, 2010

Longest Palindrome

Greplin issued a programming challenge recently, where the first question involved finding the longest palindrome substring. It was then posted as a challenge to the Programming Praxis blog, and I thought I would contribute a solution in Factor.

First, some vocabularies that we will be using:

USING: fry kernel locals make math.ranges sequences unicode.case
unicode.categories ;

As part of the Factor language tutorial, the first program many people write in Factor is a word for detecting if something is a palindrome. The implementation of palindrome? (extended to support case-insensitive comparisons using the unicode vocabulary) looks like this:

: normalize ( str -- str' ) [ Letter? ] filter >lower ;

: palindrome? ( str -- ? ) normalize dup reverse = ;

The "obvious" (but not that fast) way to solve the problem is to examine every possible substring, adding to a list if it is a palindrome. The list of palindrome substrings can then be used to answer the question. This is how we'll implement it.

I thought it would be useful to split the problem into two steps. First, we need a way to enumerate all possible substrings (not including the "empty" substring), applying a quotation to each in turn.

:: each-subseq ( ... seq quot: ( ... x -- ... ) -- ... )
    seq length [0,b] [
        :> from
        from seq length (a,b] [
            :> to
            from to seq subseq quot call( x -- )
        ] each
    ] each ;

You can try this out in the listener, to see how it works:

( scratchpad ) "abc" [ . ] each-subseq
"a"
"ab"
"abc"
"b"
"bc"
"c"

Once we have that, it's pretty easy to build a word to look for palindrome substrings:

: palindromes ( str -- seq )
    [
        [ dup palindrome? [ , ] [ drop ] if ] each-subseq
    ] { } make ;

We can use the longest word that I implemented for my anagrams post to find the longest palindrome substring:

: longest ( seq -- subseq )
    dup 0 [ length max ] reduce '[ length _ = ] filter ;
Using this on the 1169-character string from the original challenge, we find 52 unique palindromes of 2 or more characters. The longest palindrome substring I found was a 7-character sequence.

Friday, October 22, 2010

Presentations in Factor

Just a few days ago, Daniel Ehrenberg gave a presentation at the Dynamic Language Symposium 2010 in Reno, Nevada. It was quite good, and gave a nice overview of the different language features that Factor supports.

Like many of the talks about Factor, it was made (and presented!) using Factor. Specifically, the slides vocabulary, which is a simple DSL for presentations. To show how it works, let's just jump directly to an example:

( scratchpad ) USING: help.markup slides ;

( scratchpad ) {
                   { $slide "Factor!"
                       { $url "http://factorcode.org" }
                       "Development started in 2003"
                       "Open source (BSD license)"
                       "Influenced by Forth, Lisp, and Smalltalk"
                       "Blurs the line between language and library"
                       "Interactive development"
                   }
               } slides-window

When run, this pops up a window that can be used for presentations. It will look something like this:


You can control the slideshow window using the keyboard:

  • Press DOWN to move to the next slide.
  • Press UP to move to the previous slide.
  • Press F to toggle full-screen mode.

In addition, the slides vocabulary supports some other features, such as embedding code and clickable links to vocabularies or words. Here is an example which does both of these:

( scratchpad ) {
                   { $slide "Code!"
                       "Try clicking on these:"
                       { $code "2 2 +" }
                       { $vocab-link "sequences" }
                       { $link nth }
                   }
               } slides-window

If you want to customize the look-and-feel of the presentation, you need to modify the stylesheet used to render the presentation.

To see it in action, you can watch Slava Pestov give a Google Tech Talk on Factor. The slides he used are available in the Factor source tree under extra/google-tech-talk.

Slides for most of the other talks that have been given are also available (they are a nice way to get a quick overview of the "big features" in Factor):

It's so cool, even Zed Shaw gave a presentation using the slides vocabulary.

Friday, October 15, 2010

Text-to-PDF

In this article, we will be building step-by-step a program for converting text files into PDF. The PDF specification is at version 1.7 (approximately 750 pages plus some supplements) and is available for download from the Adobe website.

The entire solution listed below is approximately 140 lines of code, and compares favorably to a 600 line Python version and a 450 line C version.

First, we will need a list of imports and a namespace:

USING: assocs calendar combinators environment formatting
grouping io io.files kernel make math math.ranges sequences
splitting xml.entities ;

IN: text-to-pdf

PDF files are essentially text files that contain numbered objects with display instructions and a cross-reference table that allows random access to those objects. We will need a word to create an object from its contents and number (e.g., start with n 0 obj and end with endobj):

: pdf-object ( str n -- str' )
    "%d 0 obj\n" sprintf "\nendobj" surround ;

References (used to provide pointers to objects) are of the form n 0 R where n is the number of the object being referred to.

The PDF standard has support for various data types, one of which are text strings. These strings require escaping to be able to include certain characters (and are surrounded by parentheses):

: pdf-string ( str -- str' )
    H{
        { HEX: 08    "\\b"  }
        { HEX: 0c    "\\f"  }
        { CHAR: \n   "\\n"  }
        { CHAR: \r   "\\r"  }
        { CHAR: \t   "\\t"  }
        { CHAR: \\   "\\\\" }
        { CHAR: (    "\\("  }
        { CHAR: )    "\\)"  }
    } escape-string-by "(" ")" surround ;

Hashtables of key/value properties are used frequently within the specification and are essentially space-separated key/value pairs surrounded by << and >>.

Our PDF file will start with an "info" section that contains a creation timestamp, author, and software details:

: pdf-info ( -- str )
    [
        "<<" ,
        "/CreationDate D:" now "%Y%m%d%H%M%S" strftime append ,
        "/Producer (Factor)" ,
        "/Author " "USER" os-env "unknown" or pdf-string append ,
        "/Creator (created with Factor)" ,
        ">>" ,
    ] { } make "\n" join ;

We will follow it with a catalog indicating which object (by reference) contains the list of pages.

: pdf-catalog ( -- str )
    {
        "<<"
        "/Type /Catalog"
        "/Pages 4 0 R"
        ">>"
    } "\n" join ;

As is typical of "text2pdf" programs, we will use the Courier monospace font. We can just refer to it by name since it is one of the fonts directly supported by the PDF specification.

: pdf-font ( -- str )
    {
        "<<"
        "/Type /Font"
        "/Subtype /Type1"
        "/BaseFont /Courier"
        ">>"
    } "\n" join ;

Each page is composed of two objects: a resource object and a contents object. We convert the number of pages into an index specifying the canvas size (US Letter - "612 by 792 points" or "8.5 by 11 inches") and a list of pages (or "kids") specified as references to each pages' resource object:

: pdf-pages ( n -- str )
    [
        "<<" ,
        "/Type /Pages" ,
        "/MediaBox [ 0 0 612 792 ]" ,
        [ "/Count %d" sprintf , ]
        [
            5 swap 2 range boa
            [ "%d 0 R " sprintf ] map concat
            "/Kids [ " "]" surround ,
        ] bi
        ">>" ,
    ] { } make "\n" join ;

The resources for each page n includes a reference to the contents object (n + 1) and the font object:

: pdf-page ( n -- page )
    [
        "<<" ,
        "/Type /Page" ,
        "/Parent 4 0 R" ,
        1 + "/Contents %d 0 R" sprintf ,
        "/Resources << /Font << /F1 3 0 R >> >>" ,
        ">>" ,
    ] { } make "\n" join ;

Pages are essentially objects which contain a stream of text operations. The stream is prefixed with the length of its contents:

: pdf-stream ( str -- str' )
    [ length 1 + "<<\n/Length %d\n>>" sprintf ]
    [ "\nstream\n" "\nendstream" surround ] bi append ;

The text operations that we will use to draw lines of text on each page:

  • BT - begin text
  • Td - location (where 0,0 is the bottom left of the page)
  • Tf - font and size
  • TL - line height
  • ' - insert newline and draw a line text
  • ET - end text
: pdf-text ( lines -- str )
    [
        "BT" ,
        "54 738 Td" ,
        "/F1 10 Tf" ,
        "12 TL" ,
        [ pdf-string "'" append , ] each
        "ET" ,
    ] { } make "\n" join pdf-stream ;

Using 10-point Courier font (6 points wide by 10 points tall), 12-points of line spacing, and 3/4 inch left and right margins: each page supports 57 lines of text (where each line is 84 characters long). We use splitting and grouping words to convert a string into pages of text:

: string>lines ( str -- lines )
    "\t" split "    " join string-lines
    [ [ " " ] when-empty ] map ;

: lines>pages ( lines -- pages )
    [ 84 <groups> ] map concat 57 <groups> ;

We can then take these "pages" and assemble PDF objects (including the info, catalog, font, and page index objects):

: pages>objects ( pages -- objects )
    [
        pdf-info ,
        pdf-catalog ,
        pdf-font ,
        dup length pdf-pages ,
        dup length 5 swap 2 range boa zip
        [ pdf-page , pdf-text , ] assoc-each
    ] { } make
    dup length [1,b] zip [ first2 pdf-object ] map ;

Given a list of objects, and a 9-byte %PDF-1.4 version header in the file, we can write a PDF complete with cross-reference table trailer and %%EOF "end-of-file" marker:

: pdf-trailer ( objects -- str )
    [
        "xref" ,
        dup length 1 + "0 %d" sprintf ,
        "0000000000 65535 f" ,
        9 over [
            over "%010X 00000 n" sprintf , length 1 + +
        ] each drop
        "trailer" ,
        "<<" ,
        dup length 1 + "/Size %d" sprintf ,
        "/Info 1 0 R" ,
        "/Root 2 0 R" ,
        ">>" ,
        "startxref" ,
        [ length 1 + ] map-sum 9 + "%d" sprintf ,
        "%%EOF" ,
    ] { } make "\n" join ;

: objects>pdf ( objects -- str )
    [ "\n" join "\n" append "%PDF-1.4\n" ]
    [ pdf-trailer ] bi surround ;

Putting it all together, we can convert a text string to PDF, and a text file to PDF:

: text-to-pdf ( str -- str' )
    string>lines lines>pages pages>objects objects>pdf ;

: file-to-pdf ( path encoding -- )
    [ file-contents text-to-pdf ]
    [ [ ".pdf" append ] dip set-file-contents ] 2bi ;

Running this code on itself, produces this PDF file (so you can see what it will look like).

As usual, the code is available on my Github.

Wednesday, October 13, 2010

Is Rotation?

I noticed an "interview question" that was posted on StackOverflow awhile ago. It's not particularly complicated -- basically asking "given two strings, how to tell if one is the rotated version of the other?"

Some discussion in the question deals with various faster methods, but the simplest answer is a Python version:

def isrotation(s1, s2):
    return len(s1) == len(s2) and s1 in 2*s2

If we wanted to implement this in Factor, we might want to consider using "short circuit" combinators (which will apply a series of boolean tests and stop on the first test that fails). We will also use the convention that a word? (ending in a "?") returns a boolean.

: rotation? ( s1 s2 -- ? )
    { [ [ length ] bi@ = ] [ dup append subseq? ] } 2&& ;

We can test it, to make sure it works:

( scratchpad ) "stack" "tacks" rotation? .
t

( scratchpad ) "foo" "bar" rotation? .
f

Since strings are sequences of characters and this solution uses sequence operations (length, append, and subseq?), it is already generalized to operate on other types of sequences. For example, arrays:

( scratchpad ) { 1 2 3 } { 2 3 1 } rotation? .
t

So, the next time you get this question in an interview, maybe you can solve it in Factor!