Wednesday, February 3, 2010

Working with CGI: Part 3

In Part 2, we implemented simple parsing of QUERY_STRING and handling of the GET request method.

In getting to the conclusion, I skipped describing an important convention of HTTP application development. Specifically, that GET requests should be idempotent. Because of this, as well as privacy concerns, it is frequently common practice to submit HTML forms with a POST request.

According to the POST convention, the request data is placed in the message body and usually "URL-encoded" (as described in RFC 2396). This is similar to how certain characters are "escaped" when included in a URL (for example, spaces are represented by %20).

To properly parse these types of POST requests, we will need to parse a few other environment variables that are provided to the CGI script. First, we need a way to parse "content types". As described in RFC 2616 (the specification for HTTP/1.1), this represents a mime-type and optional parameters.

For example, a server can specify that the response type will be HTML inside of a UTF-8 character encoding by including the following in the HTTP response headers:

Content-Type: text/html; charset=utf-8

To parse this, we can simple separate the mime-type and parse the parameters:

: (content-type) ( string -- params media/type )
    ";" split unclip [
        [ H{ } clone ] [ first (query-string) ] if-empty
    ] dip ;

When we submit a form, the values are included in the body and provided to the CGI script using a mime-type of "application/x-www-form-urlencoded". The contents are provided by parameters encoded in the message body. (Technically, some of the parameters could also be specified in the URL).

We can define a function that will parse the CONTENT_LENGTH, read the specified number of bytes from the stream, and then assemble and parse the URL-encoded query string:

: (urlencoded) ( -- assoc )
    "CONTENT_LENGTH" os-env "0" or string>number
    read [ "" ] [ "&" append ] if-empty
    "QUERY_STRING" os-env [ append ] when* (query-string) ;

These two words are sufficient to parse POST requests. However, it's worth noting that some forms can be submitted with a mime-type of "multipart/form-data", which is used for uploading files to servers. We will put a placeholder word that can remind us to come back to this:

: (multipart) ( -- assoc )
    "multipart unsupported" throw ;

Now that we have that, we can write the implement the parsing routine:

: parse-post ( -- assoc )
    "CONTENT_TYPE" os-env "" or (content-type) {
       { "multipart/form-data"               [ (multipart) ] }
       { "application/x-www-form-urlencoded" [ (urlencoded) ] }
       [ drop parse-get ]
   } case nip ;

And then extend our <cgi-form> word to handle POST requests:

: <cgi-form> ( -- assoc )
    "REQUEST_METHOD" os-env "GET" or >upper {
        { "GET"  [ parse-get ] }
        { "POST" [ parse-post ] }
        [ "Unknown request method" throw ]
    } case ;

And simple as that, we can now change our form method from "get" to "post", and our CGI scripts will continue to work.

No comments: