Monday, July 4, 2011

Fourth of July

Today is Independence Day in the United States. Several lists of fantastic facts are being shared, including that today is the "biggest beer-selling holiday of the year". However, while not as fantastic, here are a few fun facts about today that can be derived using Factor.

Using the calendar vocabulary, we can create a timestamp object for today's date:

( scratchpad ) 2011 7 4 <date> .
T{ timestamp
    { year 2011 }
    { month 7 }
    { day 4 }
    { gmt-offset T{ duration { hour -7 } } }
}

We can see that today is the 185th day of the year:

( scratchpad ) 2011 7 4 <date> day-of-year .
185

Today is also the 27th week of the year (using weeks starting on Sunday - there might be a different answer using week-of-year-monday:

( scratchpad ) 2011 7 4 <date> week-of-year-sunday .
27

This year, July 4th falls on a Monday, but its a different day of the week each year. We can see which days of the week Independence Day falls on over the last few years:

( scratchpad ) 2011 2000 [a,b] [
                   dup 7 4 <date> day-name "%s: %s\n" printf
               ] each
2011: Monday
2010: Sunday
2009: Saturday
2008: Friday
2007: Wednesday
2006: Tuesday
2005: Monday
2004: Sunday
2003: Friday
2002: Thursday
2001: Wednesday
2000: Tuesday

And, finally, the most recent version of Factor (0.94) was released on September 18, 2010, which was 289 days ago.

( scratchpad ) 2011 7 4 <date> 2010 9 18 <date>
               time- duration>days .
289

Wednesday, June 29, 2011

TF-IDF

A few days ago, someone implemented a consice TF-IDF ("term frequency - inverse document frequency") search engine in Scala. That was followed by a similarly concise version in Clojure. I thought to contribute a simple (hopefully also concise) implementation using Factor.

USING: accessors arrays assocs combinators.short-circuit fry
io.encodings.utf8 io.files kernel math math.functions
math.statistics memoize sequences sets sorting splitting
unicode.case unicode.categories ;

IN: tf-idf

Tokenize

Since our inputs are going to be text files, our program will need to:

  1. Read the file into a string.
  2. Split the string into words.
  3. Eliminate common (or "stop") words.

Fist, let's build a word to split our text on any non-letter or non-number characters.

: split-words ( string -- words )
    [ { [ Letter? ] [ digit? ] } 1|| not ] split-when harvest ;

We load a list of "stop words" that are to be ignored. These are typically common occurring words such as "and, in, or, of, the, is".

MEMO: stopwords ( -- words )
    "vocab:tf-idf/stopwords.txt" utf8 file-lines fast-set ;

We can then tokenize a piece of text, removing all of the "stop words".

: tokenize ( string -- words )
    >lower split-words [ stopwords in? not ] filter ;

And, finally, we can create a word to tokenize a series of files into an assoc mapping path to words.

: tokenize-files ( paths -- assoc )
    [ dup utf8 file-contents tokenize ] H{ } map>assoc ;

Index

To implement our search engine, we need to build an index that maps each word to list of (path, count) pairs.

: index1 ( path words -- path index )
    histogram [ pick swap 2array ] assoc-map ;

: index-all ( assoc -- index )
    [ index1 ] assoc-map values assoc-merge ;

The tokenized files and our index will form the basis for a "database":

TUPLE: db docs index ;

: <db> ( paths -- db )
    tokenize-files dup index-all db boa ;

TF-IDF

The "inverse document frequency" is calculated by the log of the total number of documents divided by number of documents where a particular word appears.

: idf ( term db -- idf )
    [ nip docs>> ] [ index>> at ] 2bi [ assoc-size ] bi@ / log ;

Using this, we can create a "TF-IDF" scores by multiplying the number of times a term appears in each document by the previously calculated "inverse document frequency":

: tf-idf ( term db -- scores )
    [ index>> at ] [ idf ] 2bi '[ _ * ] assoc-map ;

Search

Our search engine is now just a matter of:

  • Splitting an input query into terms.
  • Calculating the "TF-IDF" score for each term.
  • Normalizing the scores across documents.
  • Sorting the scores from highest to lowest.
: scores ( query db -- scores )
    [ split-words ] dip '[ _ tf-idf ] map assoc-merge ;

: (normalize) ( path db -- value )
    [ docs>> at ] keep '[ _ idf 2 ^ ] map-sum sqrt ;

: normalize ( scores db -- scores' )
    '[ sum over _ (normalize) / ] assoc-map ;

: search ( query db -- scores )
    [ scores ] keep normalize sort-values reverse ;

Notes

The implementation above uses an assoc-merge utility word that performs a "union", for each key collecting a list of all values.
: (assoc-merge) ( assoc1 assoc2 -- assoc1 )
    over [ push-at ] with-assoc assoc-each ;

: assoc-merge ( seq -- merge )
    H{ } clone [ (assoc-merge) ] reduce ;
Also, searching for words that don't exist produces a "divide by zero" error. Making it more robust requires some changes to either catch and ignore that error or to "add 1" to the denominator of the "IDF" formula.

The code for this is on my Github.

Friday, June 10, 2011

Yahoo! Finance

Many programmers work for publicly traded companies, and probably spend more time than they realize watching their (or their friends) company's stock prices. To make that easier, I wanted to implement a simple wrapper to retrieve prices from Yahoo! Finance.

Quotes

You can use a "quotes.csv" interface to retrieve current price information. The way it works is to perform an HTTP request to a specially formatted URL that looks like:

http://finance.yahoo.com/d/quotes.csv?s=SYMBOLS&f=FIELDS

In the URL, SYMBOLS is a list of symbols separated by "+" and FIELDS is a list of letters and numbers (from the following table) representing fields to be requested.


a Ask a2 Average Daily Volume a5 Ask Size
b Bid b2 Ask (Real-time) b3 Bid (Real-time)
b4 Book Value b6 Bid Size c Change & Percent Change
c1 Change c3 Commission c6 Change (Real-time)
c8 After Hours Change (Real-time) d Dividend/Share d1 Last Trade Date
d2 Trade Date e Earnings/Share e1 Error Indication (returned for symbol changed / invalid)
e7 EPS Estimate Current Year e8 EPS Estimate Next Year e9 EPS Estimate Next Quarter
f6 Float Shares g Day’s Low h Day’s High
j 52-week Low k 52-week High g1 Holdings Gain Percent
g3 Annualized Gain g4 Holdings Gain g5 Holdings Gain Percent (Real-time)
g6 Holdings Gain (Real-time) i More Info i5 Order Book (Real-time)
j1 Market Capitalization j3 Market Cap (Real-time) j4 EBITDA
j5 Change From 52-week Low j6 Percent Change From 52-week Low k1 Last Trade (Real-time) With Time
k2 Change Percent (Real-time) k3 Last Trade Size k4 Change From 52-week High
k5 Percent Change From 52-week High l Last Trade (With Time) l1 Last Trade (Price Only)
l2 High Limit l3 Low Limit m Day’s Range
m2 Day’s Range (Real-time) m3 50-day Moving Average m4 200-day Moving Average
m5 Change From 200-day Moving Average m6 Percent Change From 200-day Moving Average m7 Change From 50-day Moving Average
m8 Percent Change From 50-day Moving Average n Name n4 Notes
o Open p Previous Close p1 Price Paid
p2 Change in Percent p5 Price/Sales p6 Price/Book
q Ex-Dividend Date r P/E Ratio r1 Dividend Pay Date
r2 P/E Ratio (Real-time) r5 PEG Ratio r6 Price/EPS Estimate Current Year
r7 Price/EPS Estimate Next Year s Symbol s1 Shares Owned
s7 Short Ratio t1 Last Trade Time t6 Trade Links
t7 Ticker Trend t8 1 yr Target Price v Volume
v1 Holdings Value v7 Holdings Value (Real-time) w 52-week Range
w1 Day’s Value Change w4 Day’s Value Change (Real-time) x Stock Exchange
y Dividend Yield

Using this URL format, we can build a word that retrieves current quotes for a list of symbols:

: quotes ( symbols -- csv )
    "http://finance.yahoo.com/d/quotes.csv" >url
        swap "+" join "s" set-query-param
        "sbal1v" "f" set-query-param
    http-get nip >string string>csv
    { "Symbol" "Bid" "Ask" "Last" "Volume" } prefix ;

With the strings.table vocabulary, we can format the response as a table:

( scratchpad ) { "MSFT" "GOOG" "AAPL" } quotes
               format-table [ print ] each
Symbol Bid    Ask    Last    Volume
MSFT   23.64  23.69  23.705  49327104
GOOG   505.00 509.05 509.505 2440475
AAPL   N/A    334.80 325.90  15504200

Historical Prices

You can also retrieve historical prices using a "table.csv" interface. Similar to retrieving quotes, you make an HTTP request to a special URL:

http://ichart.finance.yahoo.com/table.csv?s=SYMBOL

In the URL, SYMBOL is the symbol that you are requesting prices for, and you can further limit the response using additional parameters:

Start date for historical prices:

  • a - Month number, starting with 0 for January.
  • b - Day number, eg, 1 for the first of the month.
  • c - Year.

End date for historical prices (default is the most current available closing price):

  • d - Month number, starting with 0 for January.
  • e - Day number, eg, 1 for the first of the month.
  • f - Year.

And finally, the frequency of historical prices:

  • g - Possible values are 'd' for daily (the default), 'w' for weekly, and 'm' for monthly.

With this knowledge, we can build a word to retrieve historical prices from January 1, 2009 until the current day.

: historical-prices ( symbol -- csv )
    "http://ichart.finance.yahoo.com/table.csv" >url
       swap "s" set-query-param
       "0" "a" set-query-param
       "1" "b" set-query-param
       "2009" "c" set-query-param
    http-get nip string>csv ;

To use it, and demonstrate the impact that AAPL has had over the last few years, we can chart the daily closing prices (remembering to reverse the order of the prices, oldest to newest):


The code for this is on my Github.

Tuesday, May 17, 2011

Fibonacci Wars

Inspired by a recent article about the "worst algorithm in the world" for calculating Fibonacci numbers using Python, I wanted to show a few implementations in Factor.

Recursive

Most people should be familiar with the simple definition for calculating the Fibonacci numbers:

Some recursive implementations of this algorithm suffer from very slow performance due to the nature of hugely recursive functions, making them unsuitable for larger Fibonacci numbers. Thankfully, it is easy to implement caching using the memoize vocabulary:

MEMO: slow-fib ( m -- n )
    dup 0 >= [ throw ] unless
    dup 2 >= [
        [ 2 - slow-fib ] [ 1 - slow-fib ] bi +
    ] when ;

Iterative

A better way would be to implement an iterative solution, that simply keeps the current and previous Fibonacci number on the stack, adding them the specified number of times:

: okay-fib ( m -- n )
    dup 0 >= [ throw ] unless
    [ 0 1 ] dip [ [ + ] [ drop ] 2bi ] times drop ;

Magical

A much better way (introduced by the original article) suggests taking advantage of a matrix expression for the Fibonacci numbers:

Using this, we can implement a faster version in Factor (using locals to make it a bit more readable):

:: fast-fib ( m -- n )
    m 0 >= [ m throw ] unless
    m 2 >base [ CHAR: 1 = ] { } map-as :> bits
    1 :> a! 0 :> b! 1 :> c!
    bits [
        [
            a c + b *
            b sq c sq +
        ] [
            a sq b sq +
            a c + b *
        ] if b! a! a b + c!
    ] each b ;

Performance

Some simple performance numbers show how much faster the "magical" version is, particularly for calculating large Fibonacci numbers (such as the the one-millionth number in the sequence).

10,000 100,000 1,000,000
slow-fib 0.0097 -- --
okay-fib 0.0053 0.2560 25.9608
fast-fib 0.0001 0.0076 0.4851

It's worth pointing out that Factor is about as fast as Python 2.6 for the "magical" calculation. In Python 2.7 and later, certain improvements were made to their large number support, resulting in Python being about 3x faster.

However, if you want real performance, you can use C and the GNU Multiple Precision Arithmetic Library to implement the iterative algorithm (about 3 times faster than Factor), the magical algorithm (about 28 times faster than Factor), or use the builtin mpz_fib_ui function (about 44 times faster than Factor).

The code for this (as well as several version in C) are on my Github.

Sunday, May 15, 2011

Google Search

A couple months ago, I implemented wrappers for the Google Charts and Google Translate API's. Today, I'd like to implement a wrapper for Google Search.

Search

First, we should build a word that, given a query, returns a URL to retrieve Google Search results (formatted as JSON):

: search-url ( query -- url )
    URL" http://ajax.googleapis.com/ajax/services/search/web"
        "1.0" "v" set-query-param
        swap "q" set-query-param
        "8" "rsz" set-query-param
        "0" "start" set-query-param ;

We can define a tuple class to hold the attributes every search result should have:

TUPLE: search-result cacheUrl GsearchResultClass visibleUrl
title content unescapedUrl url titleNoFormatting ;

Using some code to set attributes dynamically, we can perform a Google Search, and parse the results into a sequence of search-result objects.

: http-search ( query -- results )
    search-url http-get nip json>
    { "responseData" "results" } [ swap at ] each
    [ \ search-result from-slots ] map ;

Display

We can build some simple words that can be used to format the output of a search:

: write-heading ( str -- )
    H{ 
        { font-size 14 }
        { background COLOR: light-gray }
    } format nl ;

: write-title ( str -- )
    H{ 
        { foreground COLOR: blue }
    } format nl ;

: write-content ( str -- )
    60 wrap-string print ;

: write-url ( str -- )
    dup >url H{ 
        { font-name "monospace" }
        { foreground COLOR: dark-green }
    } [ write-object ] with-style nl ;

And then create a word to perform a search and display the results. If you are using my webbrowser vocabulary, you can open the URL's in a webbrowser directly from Factor.

: http-search. ( query -- )
    [ "Search results for '%s'" sprintf write-heading nl ]
    [ http-search ] bi [
        {
            [ titleNoFormatting>> write-title ]
            [ content>> write-content ]
            [ unescapedUrl>> write-url ]
        } cleave nl
    ] each ;

Try it out

If everything works correctly, you should be able to perform a search in the Listener (e.g., searching for "factor"):


Some things we might do to improve this:

  • add paging, to the next or previous page of search results
  • highlight in bold the searched words in the content
  • unescape the HTML entities in the content
  • build a lightweight GUI to render the results

The code for this is on my Github.

Thursday, May 5, 2011

Open URL (in Listener)

In January, I implemented a cross-platform open-url feature to allow opening URLs from Factor. One major problem was that I couldn't figure out how to make URLs "clickable" in the Listener.

Then, I learned that the ui.operations vocabulary allows us to add "capabilities" that get triggered when certain types are rendered in the Listener. It turns out to be pretty easy, actually:

[ url? ] \ open-url H{ } define-operation

Using that, you can right-click on URL objects, and choose to open them in your web browser:


I also made a small patch to add this support to URLs that are rendered in the help system or in the slides vocabulary.

Note: if you want this operation to be the "primary operation" triggered when the URL is left-clicked, you can add { +primary+ t } to the hashtable used in the operation's definition.

Sunday, May 1, 2011

Reddit Stats

I thought it would be fun to track the development of the Factor programming language using Reddit. Of course, to do this, I wanted to use Factor.

A few months ago, I implemented a Reddit "Top" program. We can use the code for that as a basis for showing, using Reddit scores, how the activity (or popularity?) of a couple Factor blogs have changed over the last few years:


Stories

We start by creating a story tuple that will hold all of the properties that Reddit returns for each posted link (including the score -- up votes minus down votes -- which we will be using).

TUPLE: story author clicked created created_utc domain downs
hidden id is_self levenshtein likes media media_embed name
num_comments over_18 permalink saved score selftext
selftext_html subreddit subreddit_id thumbnail title ups url ;

And a word to parse a "data" result into a story object:

: parse-story ( assoc -- obj )
    "data" swap at \ story from-slots ;

Paging

The Reddit API provides results as a series of "pages" (until all results are exhausted). This is pretty similar to how the website works, so it should be fairly easy to understand the mechanics. We will define a page object that holds the current URL, the results from the current page, and links to the pages before and after the current page:

TUPLE: page url results before after ;

We can then build a word to fetch the page (as a JSON response), parse it, and construct a page object:

: json-page ( url -- page )
    >url dup http-get nip json> "data" swap at {
        [ "children" swap at [ parse-story ] map ]
        [ "before" swap at [ f ] when-json-null ]
        [ "after" swap at [ f ] when-json-null ]
    } cleave \ page boa ;

An easy "paging" mechanism that, given a page of results, can return the next page:

: next-page ( page -- page' )
    [ url>> ] [ after>> "after" set-query-param ] bi json-page ;

And, using the make vocabulary, construct all the results into a sequence:

: all-pages ( page -- results )
    [
        [ [ results>> , ] [ dup after>> ] bi ]
        [ next-page ] while drop
    ] { } make concat ;

Domain Stats

We can retrieve all the results and produces a map of years that links were posted to a sum of the scores for each year's links:

: domain-stats ( domain -- stats )
    "http://api.reddit.com/domain/%s" sprintf json-page all-pages
    [ created>> 1000 * millis>timestamp year>> ] group-by
    [ [ score>> ] map-sum ] assoc-map ;

Finally, a word to display a chart of the scores across a given list of domains:

: domains. ( domains -- )
    H{ } clone [
        '[ domain-stats [ swap _ at+ ] assoc-each ] each
    ] keep >alist sort-keys <bar> 160 >>height chart. ;

The code for this is available on my Github.