Wednesday, February 3, 2010

Working with CGI: Part 3

In Part 2, we implemented simple parsing of QUERY_STRING and handling of the GET request method.

In getting to the conclusion, I skipped describing an important convention of HTTP application development. Specifically, that GET requests should be idempotent. Because of this, as well as privacy concerns, it is frequently common practice to submit HTML forms with a POST request.

According to the POST convention, the request data is placed in the message body and usually "URL-encoded" (as described in RFC 2396). This is similar to how certain characters are "escaped" when included in a URL (for example, spaces are represented by %20).

To properly parse these types of POST requests, we will need to parse a few other environment variables that are provided to the CGI script. First, we need a way to parse "content types". As described in RFC 2616 (the specification for HTTP/1.1), this represents a mime-type and optional parameters.

For example, a server can specify that the response type will be HTML inside of a UTF-8 character encoding by including the following in the HTTP response headers:

Content-Type: text/html; charset=utf-8

To parse this, we can simple separate the mime-type and parse the parameters:

: (content-type) ( string -- params media/type )
    ";" split unclip [
        [ H{ } clone ] [ first (query-string) ] if-empty
    ] dip ;

When we submit a form, the values are included in the body and provided to the CGI script using a mime-type of "application/x-www-form-urlencoded". The contents are provided by parameters encoded in the message body. (Technically, some of the parameters could also be specified in the URL).

We can define a function that will parse the CONTENT_LENGTH, read the specified number of bytes from the stream, and then assemble and parse the URL-encoded query string:

: (urlencoded) ( -- assoc )
    "CONTENT_LENGTH" os-env "0" or string>number
    read [ "" ] [ "&" append ] if-empty
    "QUERY_STRING" os-env [ append ] when* (query-string) ;

These two words are sufficient to parse POST requests. However, it's worth noting that some forms can be submitted with a mime-type of "multipart/form-data", which is used for uploading files to servers. We will put a placeholder word that can remind us to come back to this:

: (multipart) ( -- assoc )
    "multipart unsupported" throw ;

Now that we have that, we can write the implement the parsing routine:

: parse-post ( -- assoc )
    "CONTENT_TYPE" os-env "" or (content-type) {
       { "multipart/form-data"               [ (multipart) ] }
       { "application/x-www-form-urlencoded" [ (urlencoded) ] }
       [ drop parse-get ]
   } case nip ;

And then extend our <cgi-form> word to handle POST requests:

: <cgi-form> ( -- assoc )
    "REQUEST_METHOD" os-env "GET" or >upper {
        { "GET"  [ parse-get ] }
        { "POST" [ parse-post ] }
        [ "Unknown request method" throw ]
    } case ;

And simple as that, we can now change our form method from "get" to "post", and our CGI scripts will continue to work.

Tuesday, February 2, 2010

Working with CGI: Part 2

In Part 1, we created a simple debugging script to print out the environment the CGI script is being executed within.

Many CGI scripts are simple "form handlers". These scripts take input via an HTML form and generate a dynamic response. We are going to write a CGI script that will be able to parse input from an HTML form submission.

The most common type of HTTP request is the GET method. The web browser sends a URL to the server and requests the server "get" the contents of the URL and send it back. When HTML forms are submitted using the GET method, the form elements are "URL encoded" and passed to the server as the "query string" part of the URL.

For example, if I had a "calculator" application to add two numbers (e.g., "x+y"), you could imagine getting the result of 2+3 by calling:


We need a word that will parse the QUERY_STRING and return a map of submitted parameters. Luckily, Factor has such a word in the urls.encoding vocabulary:

( scratchpad ) "x=2&y=3" query>assoc .
H{ { "x" "2" } { "y" "3" } }

For our use case, the query>assoc word isn't quite what we need. For one thing, it handles empty strings in an odd way:

( scratchpad ) "" query>assoc .
H{ { "" f } }

Also, it doesn't handle multiple inputs with the same name consistently with single inputs. If a parameter is represented in the query string multiple times, it will appear in the result as a list of values.

( scratchpad ) "a=2&a=3" query>assoc .
H{ { "a" { "2" "3" } } }

So to "fix" this, we will develop a word that filters out f values, and returns both single and multiple parameters as sequences.

: (query-string) ( string -- assoc )
    query>assoc [ nip ] assoc-filter
    [ dup string? [ 1array ] when ] assoc-map ;

Now that we have our building blocks, we can begin supporting the GET request. Let's start by designing the API. We want to parse the request method, handle the GET method, and return the parameters submitted. The REQUEST_METHOD and QUERY_STRING are available as environment variables:

: parse-get ( -- assoc )
    "QUERY_STRING" os-env "" or (query-string) ;

: <cgi-form> ( -- assoc )
    "REQUEST_METHOD" os-env "GET" or >upper {
        { "GET"  [ parse-get ] }
        [ "Unknown request method" throw ]
    } case ;

Since frequently we will only need to worry about the first parameter value (ignoring subsequent values if present), we can make a simple version that can be optionally used:

: <cgi-simple-form> ( -- assoc )
    <cgi-form> [ first ] assoc-map ;

Putting all of this together, we can build something useful: a Brainfuck interpreter accessible from a web page!

USING: assocs brainfuck cgi formatting io kernel ;

"Content-type: text/html\n\n" write

"code" <cgi-simple-form> at
""" or dup get-brainfuck

<form method='get'>
<textarea id="text" name="code" cols="80" rows="15">
<input type="submit" value="Submit"> 
<input type="reset" value="Reset">
""" printf

If nothing is specified, this will happily calculate and then print "Hello World!", otherwise it will compute the result of the code provided.