Monday, April 21, 2014

Scraping Re: Factor

For today's post, I thought it would be fun to build a little interface to my blog, Re: Factor. In addition, you can use this to easily scrape any Blogger website.

Scraping

A simple way to make URLs that are relative to my blogs domain:

: re-factor-url ( str -- url )
    "http://re-factor.blogspot.com/" prepend ;

Using that to get a URL that returns all of the posts as JSON objects.

: posts-url ( -- url )
    "feeds/posts/default?alt=json&max-results=200" re-factor-url ;
Note: we limit the results to 200 posts. Since this is my 190th post, that will work for a little while longer but that limit might need to be bumped up in the future. :-)

Retrieving all of the posts is easy using our HTTP client and parsing the response. Since my posts don't change that frequently, for convenience we will memoize the list result.

MEMO: all-posts ( -- posts )
    posts-url http-get nip json> { "feed" "entry" } [ of ] each ;

Displaying

A simple way to display a list of posts is to display the title of each post and link it to the URL of each post (allowing us to right-click open URLs in the listener).

CONSTANT: post-style H{ { foreground COLOR: blue } }

: posts. ( -- )
    all-posts [
        [ "title" of "$t" of ] [ "link" of ] bi
        over '[ "title" of _ = ] find nip "href" of
        >url post-style [ write-object ] with-style nl
    ] each ;

For individual posts, we will use the html.parser.printer vocabulary to parse the HTML content and display it as text. The conversion to text right now is not perfect, but works okay for most things.

We print the title of the post and a dashed line underneath:

: post-title. ( post -- )
    { "title" "$t" } [ of ] each
    [ print ] [ length CHAR: - <string> print ] bi nl ;

We print the content by rendering the HTML into a string of text, then cleaning up extra whitespace and HTML escapes (using the new html.entities vocabulary), and wrapping the paragraphs.

: post-content. ( post -- )
    { "content" "$t" } [ of ] each
    parse-html html-text html-unescape string-lines [
        [ blank? not ] cut-when
        [ write ] [ 70 wrap-string print ] bi*
    ] each ;

Putting those together, to display a post:

: post. ( n -- )
    all-posts nth [ post-title. ] [ post-content. ] bi ;

The code for this is on my GitHub.

No comments: