Tuesday, April 22, 2014

Checksum Improvements

Just a quick update, some checksum improvements have been contributed to Factor.

Some new checksums have been implemented:

And some checksum performance has been improved:

  • checksums.md5 is a lot faster (benchmark is 0.080 vs 0.583 seconds)
  • checksums.sha is a bit faster (benchmark is 0.418 vs 0.686 seconds)

You can find these changes (and more!) in the development version of Factor.

Monday, April 21, 2014

Scraping Re: Factor

For today's post, I thought it would be fun to build a little interface to my blog, Re: Factor. In addition, you can use this to easily scrape any Blogger website.

Scraping

A simple way to make URLs that are relative to my blogs domain:

: re-factor-url ( str -- url )
    "http://re-factor.blogspot.com/" prepend ;

Using that to get a URL that returns all of the posts as JSON objects.

: posts-url ( -- url )
    "feeds/posts/default?alt=json&max-results=200" re-factor-url ;
Note: we limit the results to 200 posts. Since this is my 190th post, that will work for a little while longer but that limit might need to be bumped up in the future. :-)

Retrieving all of the posts is easy using our HTTP client and parsing the response. Since my posts don't change that frequently, for convenience we will memoize the list result.

MEMO: all-posts ( -- posts )
    posts-url http-get nip json> { "feed" "entry" } [ of ] each ;

Displaying

A simple way to display a list of posts is to display the title of each post and link it to the URL of each post (allowing us to right-click open URLs in the listener).

CONSTANT: post-style H{ { foreground COLOR: blue } }

: posts. ( -- )
    all-posts [
        [ "title" of "$t" of ] [ "link" of ] bi
        over '[ "title" of _ = ] find nip "href" of
        >url post-style [ write-object ] with-style nl
    ] each ;

For individual posts, we will use the html.parser.printer vocabulary to parse the HTML content and display it as text. The conversion to text right now is not perfect, but works okay for most things.

We print the title of the post and a dashed line underneath:

: post-title. ( post -- )
    { "title" "$t" } [ of ] each
    [ print ] [ length CHAR: - <string> print ] bi nl ;

We print the content by rendering the HTML into a string of text, then cleaning up extra whitespace and HTML escapes (using the new html.entities vocabulary), and wrapping the paragraphs.

: post-content. ( post -- )
    { "content" "$t" } [ of ] each
    parse-html [ html-text. ] with-string-writer
    html-unescape string-lines [
        [ blank? not ] cut-when
        [ write ] [ 70 wrap-string print ] bi*
    ] each ;

Putting those together, to display a post:

: post. ( n -- )
    all-posts nth [ post-title. ] [ post-content. ] bi ;

The code for this is on my GitHub.

Saturday, April 5, 2014

Speedtest

Many people are familiar with Speedtest.net, which is used to test a network connection, displaying download speeds, upload speeds, and server latency. Implemented as a Flash-based interface, it can be used from a web browser to verify your internet provider is giving you what you pay for.

You might not be aware that the speedtest-cli project provides a way to check internet speed from the command line in a similar manner.

I thought it might be fun to implement an interface to Speedtest.net using Factor:

Closest Servers

Speedtest provides a list of available servers all over the world that can be used for testing, returned as XML. After parsing the XML document, we use a utility method to extract attributes for each server into an array:

: attr-map ( tag -- attrs )
    attrs>> [ [ main>> ] dip ] H{ } assoc-map-as ;

: speedtest-servers ( -- servers )
    "http://www.speedtest.net/speedtest-servers.php"
    http-get nip string>xml
    "server" deep-tags-named [ attr-map ] map ;

Calculating the geographical distance between two points, specified by latitude and longitude:

: radians ( degrees -- radians ) pi * 180 /f ; inline

:: geo-distance ( lat1 lon1 lat2 lon2 -- distance )
    6371 :> radius ! km
    lat2 lat1 - radians :> dlat
    lon2 lon1 - radians :> dlon
    dlat 2 / sin sq dlon 2 / sin sq
    lat1 radians cos lat2 radians cos * * + :> a
    a sqrt 1 a - sqrt fatan2 2 * :> c
    radius c * ;

This lets us find the closest server to a given geographic location:

: lat/lon ( assoc -- lat lon )
    [ "lat" of ] [ "lon" of ] bi [ string>number ] bi@ ;

: server-distance ( server lat lon -- server )
    '[ lat/lon _ _ geo-distance "distance" ] keep
    [ set-at ] keep ;

: closest-servers-to ( lat lon -- servers )
    [ speedtest-servers ] 2dip '[ _ _ server-distance ] map
    [ "distance" of ] sort-with ;

The available Speedtest configuration provides our latitude and longitude, allowing us to sort the server list by geographic distance:

TUPLE: config client times download upload ;

C: <config> config

: speedtest-config ( -- config )
    "http://www.speedtest.net/speedtest-config.php"
    http-get nip string>xml {
        [ "client" deep-tag-named attr-map ]
        [ "times" deep-tag-named attr-map ]
        [ "download" deep-tag-named attr-map ]
        [ "upload" deep-tag-named attr-map ]
    } cleave <config> ;

: closest-servers ( -- servers )
    speedtest-config client>> lat/lon closest-servers-to ;

Best Server

We can calculate latency by downloading a small latency.txt file and timing how long it takes:

: (server-latency) ( server -- ms )
    "url" of >url URL" latency.txt" derive-url
    [ http-get nip "test=test\n" = ] benchmark 1,000,000 /f
    3,600,000 ? ;

After calculating latency, we save it for later use:

: server-latency ( server -- server )
    [ (server-latency) "latency" ] keep [ set-at ] keep ;

The "best" server that we will use for testing is the one with the lowest latency, checking the five closest servers to our location:

: best-server ( -- server )
    closest-servers 5 short head
    [ server-latency ] parallel-map
    [ "latency" of ] sort-with first ;

Upload Speed

To calculate upload speed, we upload several document sizes (filling the content with zeroes) and time how long it takes:

: upload-data ( size -- data )
    9 - CHAR: 0 <string> "content1=" prepend ;

: (upload-speed) ( server -- Mbps )
    "url" of >url { 250,000 500,000 } [
        [
            upload-data [ swap http-put 2drop ] keep length
        ] with map-sum
    ] benchmark 1,000,000,000 /f / 8 * 1,000,000 / ;

After calculating upload speed, we save it for later use:

: upload-speed ( server -- server )
    [ (upload-speed) "upload" ] keep [ set-at ] keep ;

Download Speed

To calculate download speed, we download several files with varying sizes in parallel and time how long it takes:

: download-urls ( server -- urls )
    "url" { 350 500 750 1000 } 
    [ dup "random%sx%s.jpg" sprintf >url derive-url ] with map ;

: (download-speed) ( server -- Mbps )
    download-urls 4 swap <array> [
        [ [ http-get nip length ] map-sum ] parallel-map sum
    ] benchmark 1,000,000,000 /f / 8 * 1,000,000 / ;

After calculating download speed, we save it for later use:

: download-speed ( server -- server )
    [ (download-speed) "download" ] keep [ set-at ] keep ;

Text Results

With all of that built, we can build a word to run a Speedtest, printing out the results as text:

: run-speedtest ( -- server )
    "Selecting best server based on ping..." print flush
    best-server dup {
        [ "sponsor" of ]
        [ "name" of ]
        [ "distance" of ]
        [ "latency" of ]
    } cleave "Hosted by %s (%s) [%0.2f km]: %s ms\n" printf
    "Testing download speed" print flush download-speed
    dup "download" of "Download: %0.2f Mbit/s\n" printf
    "Testing upload speed" print flush upload-speed
    dup "upload" of "Upload: %0.2f Mbit/s\n" printf ;

Graphic Results

It would be nice if we could show the reports graphically, and as it turns out, its not too hard. We just have to upload the results to speedtest.net in the same way their Flash application does, and then display the image that is created for you.

: make-result ( server -- result )
    [
        {
            [ "download" of 1,000 * >integer "download" ,, ]
            [ "latency" of >integer "ping" ,, ]
            [ "upload" of 1,000 * >integer "upload" ,, ]
            [ drop "" "promo" ,, ]
            [ drop "pingselect" "startmode" ,, ]
            [ "id" of "recommendedserverid" ,, ]
            [ drop "1" "accuracy" ,, ]
            [ "id" of "serverid" ,, ]
            [
                [ "latency" of ]
                [ "upload" of 1,000 * ]
                [ "download" of 1,000 * ] tri
                "%d-%d-%d-297aae72" sprintf md5 checksum-bytes
                hex-string "hash" ,,
            ]
        } cleave
    ] { } make ;

: submit-result ( server -- result-id )
    make-result "http://www.speedtest.net/api/api.php"
    <post-request> [
        [
            "http://c.speedtest.net/flash/speedtest.swf"
            "referer"
        ] dip header>> set-at
    ] keep http-request nip query>assoc "resultid" of ;

Speedtest

Putting this all together, we can run the Speedtest, submit the results, then display the test results as an image.

: speedtest ( -- )
    run-speedtest submit-result "Share results: " write
    "http://www.speedtest.net/result/%s.png" sprintf
    [ dup >url write-object nl ] [ http-image. ] bi ;

Some things that I would like to improve:

  • The Speedtest configuration actually specifies the details of download and upload sizes, the amount of parallelism, and the duration of the test, we should use it.
  • The http-get word needs an overall timeout so we can scale between very slow and very fast connection speeds.
  • The Speedtest graphical result images are "retina" when viewed in the web browser, but are not when downloaded from Factor or wget.
  • Factor needs an easier way to create a queue of work that is processed by several worker threads, for convenience I just used one of the concurrent combinators.

The code for this is on my GitHub.

Sunday, February 9, 2014

inet_ntoa and inet_aton

I was reading an article about micro optimizing int to IP address conversions. The author was trying to convert a 32-bit integer representation of an IP address into the more typical string representation using Common Lisp.

This is basically what the standard C library functions inet_ntoa and inet_aton do. I thought it might be fun to implement this in Factor and compare performance with the C versions.

alien

First, lets use the alien FFI vocabulary to allow the C functions to be called from Factor:

FUNCTION: c-string inet_ntoa ( uint32_t addr ) ;

FUNCTION: int inet_aton ( c-string s, uint32_t *addr ) ;

We can call inet_ntoa directly, but to call inet_aton, we need a simple wrapper that calls it, preserves the result, and checks for success or failure:

: inet-aton ( x -- y )
    { uint32_t } [ inet_aton 1 assert= ] with-out-parameters ;

We can test to see that it works:

IN: scratchpad 81952074 inet_ntoa .
"74.125.226.4"

IN: scratchpad "74.125.226.4" inet-aton .
81952074

Parsing 1 million integers with inet_ntoa takes 1.016 seconds.

Parsing 1 million IP addresses with inet_aton takes 0.626 seconds.

simple

Okay, what if we want to implement these ourselves?

Note: Unlike inet_ntoa (which is in network byte order), we will assume little endian like the original author that spawned this adventure.

Converting an integer into an IP address string by taking each octet of the 32-bit number (based on the pseudocode used in the original article):

: ipv4-ntoa ( integer -- ip )
    { 0x1000000 0x10000 0x100 0x1 }
    [ /i 8 bits number>string ] with map "." join ;

Converting an IP address into an integer is as easy as splitting on the dots and performing the reverse operation:

: ipv4-aton ( ip -- integer )
    "." split [ string>number ] map
    { 0x1000000 0x10000 0x100 0x1 } v. ;

We can test that it works:

IN: scratchpad 1249763844 ipv4-ntoa .
"74.125.226.4"

IN: scratchpad "74.125.226.4" ipv4-aton .
1249763844

Parsing 1 million integers with ipv4-ntoa takes 0.653 seconds.

Parsing 1 million IP addresses with ipv4-aton takes 0.738 seconds.

faster

In the spirit of the original article, we will try some micro-optimizations (with some corresponding loss in readability) including type annotations.

Our versions use the generalized number parsing words, string>number and number>string. Specialized (less general) versions can give us additional performance:

TYPED: byte>string ( byte: fixnum -- str )
    $[ 256 iota [ number>string ] map ] nth ;

TYPED: string>byte ( str: string -- byte )
    0 [ [ 10 * ] dip CHAR: 0 - + ] reduce ;

We make a few other changes to use shifting and a slightly different approach:

TYPED: ipv4-ntoa2 ( integer: fixnum -- ip )
    { -24 -16 -8 0 } [ 8 shift-mod byte>string ] with map
    "." join ;

TYPED: ipv4-aton2 ( ip: string -- integer )
    "." split { 24 16 8 0 }
    [ [ string>byte ] dip shift ] [ + ] 2map-reduce ;

Parsing 1 million integers with ipv4-ntoa2 takes 0.436 seconds!

Parsing 1 million IP addresses with ipv4-aton2 takes 0.496 seconds!

fastest

If we really want to do more micro-optimizations, and produce some ugly but fast code, then we can change the code to ensure more fixnum operations:

TYPED: byte>string2 ( byte: fixnum -- str )
    $[ 256 iota [ number>string ] map ] nth-unsafe ;

TYPED: string>byte2 ( str: string -- byte )
    [ length iota 0 ] keep [
        string-nth-fast
        [ 10 fixnum*fast ] dip CHAR: 0 fixnum-fast fixnum+fast
    ] curry reduce ;

TYPED: ipv4-ntoa3 ( integer: fixnum -- ip )
    $[ { -24 -16 -8 0 } [ [ 8 shift-mod ] curry ] map ] cleave
    [ byte>string2 ] 4 napply 4array "." join ;

TYPED: ipv4-aton3 ( ip: string -- integer )
    "." split first4 [ string>byte2 ] 4 napply
    [ 24 16 8 [ fixnum-shift-fast ] tri-curry@ tri* ] dip
    [ fixnum+fast ] tri@ ;

Parsing 1 million integers with ipv4-ntoa3 takes 0.285 seconds!

Parsing 1 million IP addresses with ipv4-aton3 takes 0.355 seconds!

I have committed more complete versions of ipv4-ntoa and ipv4-aton (with support for parsing IPv4 addresses) to the development version of Factor.

Wednesday, January 22, 2014

Caesar Cipher

A Caesar cipher is a very simple encryption technique where each letter is shifted a fixed number of characters in the alphabet. It is named after Julius Caesar, who apparently used this technique in some of his letters.

For example, if we were to encode "FACTOR" by shifting each character to the right by three letters, we would get "IDFWRU". The "F" shifts to "I", the "A" shifts to "D", the "C" shifts to "F", etc.

Let's implement this in Factor!

First, we implement a word to shift a character (uppercase by convention) a specified number of letters. Using "A" as our "zero" point by subtracting, shifting modulo 26 character ascii alphabet, then re-adding the ascii value for "A" (65):

: caesar-shift ( ch n -- ch' )
    [ CHAR: A - ] dip + 26 rem CHAR: A + ;

Next, a word for shifting every letter in a string (preserving numbers and punctuation):

: caesar-map ( str n -- str' )
    '[ dup CHAR: A CHAR: Z between? [ _ caesar-shift ] when ] map ;

This lets us implement the encrypt and decrypt words. Encrypting is simple and decrypting is mapping with a negative shift number:

: caesar-encrypt ( plain n -- encrypted ) caesar-map ;

: caesar-decrypt ( encrypted n -- plain ) neg caesar-map ;

Trying it out:

IN: scratchpad "HELLO, WORLD!" 3 caesar-encrypt .
"KHOOR, ZRUOG!"

IN: scratchpad "KHOOR, ZRUOG!" 3 caesar-decrypt .
"HELLO, WORLD!"

Friday, December 13, 2013

Monte Carlo

The Monte Carlo method is a method of estimation that uses random simulation to solve physical or mathematical problems. Named after the Monte Carlo Casino, you can think of the method as playing a game of chance many times and recording the outcomes (such as how frequently one wins or loses).

One classic example from the Wikipedia article is estimating the value of π (although there are many other ways to approximate the value of π):

For example, consider a circle inscribed in a unit square. Given that the circle and the square have a ratio of areas that is π/4, the value of π can be approximated using a Monte Carlo method:

  1. Draw a square on the ground, then inscribe a circle within it.
  2. Uniformly scatter some objects of uniform size (grains of rice or sand) over the square.
  3. Count the number of objects inside the circle and the total number of objects.
  4. The ratio of the two counts is an estimate of the ratio of the two areas, which is π/4. Multiply the result by 4 to estimate π.

You can visualize how this works by seeing the estimate get more correct as you increase the number of points:

We can generate a random point inside the unit circle by generating two random-unit values:

: random-point ( -- x y )
    random-unit random-unit ;

Using the Pythagorean theorem, we can calculate the distance from the zero point. If the distance is less than or equal to 1.0, then it is inside the circle:

: inside-circle? ( x y -- ? )
    [ sq ] bi@ + sqrt 1.0 <= ;

We can then estimate the value of π by computing a number of points, taking the percentage that are inside the circle and multiplying by 4:

: estimate-pi ( points -- pi-estimate )
    0 swap [
        [ random-point inside-circle? [ 1 + ] when ] times
    ] keep /f 4 * ;

We can run this for varying numbers of points and see how it gets more accurate:

IN: scratchpad { 100 1,000 10,000 100,000 1,000,000 10,000,000 }
               [ estimate-pi . ] each
3.2
3.168
3.162
3.15176
3.14288
3.1418212

Tuesday, December 10, 2013

UU Encoding

Just a quick note, as of a couple months ago, Factor has support for uuencoding (and uudecoding)!

You can perform a uuencode:

IN: scratchpad "Factor" string>uu print
begin
&1F%C=&]R
end

...and also a uudecode:

IN: scratchpad """
               begin 644 factor.txt
               &1F%C=&]R
               end
               """ uu>string .
"Factor"

Right now, it operates on text directly and doesn't preserve the file name and permissions in the begin header, but that would be an easy improvement.

The code for this is available in the development version of Factor.