Friday, July 16, 2010

Parsing Configuration Files

The other day I needed a parser for INI-style configuration files. When I couldn't find a convenient Factor vocabulary to do this, I decided to write one.

A basic configuration file could look like this:

[owner]
name=John Doe
e-mail=john.doe@example.com

[database]
host=127.0.0.1  # change to production when ready
port=1234
username=test
password="a really long string"

These configurations are essentially groups of name/value pairs, and can be naturally expressed as an assoc. We will be implementing a simple API for reading and writing:

: read-ini ( -- assoc )

: write-ini ( assoc -- )

: string>ini ( str -- assoc )

: ini>string ( assoc -- str )

This implementation uses these vocabularies:

USING: arrays assocs combinators formatting hashtables io
io.streams.string kernel make math sequences strings
strings.parser ;

Some utility words are used to trim spaces from tokens, extract strings from section names (e.g., "[database]"), and remove comments from lines:

: unspace ( str -- str' )
    [ " \t\n\r" member? ] trim ;

: unwrap ( str -- str' )
    1 swap [ length 1 - ] keep subseq ;

: uncomment ( str -- str' )
    CHAR: # over index [ head ] when* ;

There are a variety of parsing strategies we could use here. To keep things simple, we will be parsing the configuration file line-by-line. Also, we will make the assumption that each line contains either a "[section]" or a "name=value" (but not both).

We know a line is a section if it starts with '[' and ends with ']':

: section? ( line -- ? )
    [ first CHAR: [ = ] [ last CHAR: ] = ] bi and ;

The current section is parsed and stored as a two-element array containing the name of the section and a vector of name/value pairs:

: [section] ( line -- section )
    unwrap unspace V{ } clone 2array ;

Each name/value is parsed and added to the vector of name/value pairs in the current section:

: name=value ( section line -- section' )
    CHAR: = over index cut rest [ unspace ] bi@
    2array over second push ;

We will be using the make words. When we encounter a new section, or the end of the file, we will append the current section to the sequence of sections being built by make:

: section, ( section/f -- )
    [ first2 >hashtable 2array , ] when* ;

: parse-line ( section line -- section' )
    uncomment unspace [
        dup section?
        [ swap section, [section] ] [ name=value ] if
    ] unless-empty ;

: read-ini ( -- assoc )
    [
        f [ parse-line ] each-line section,
    ] { } make >hashtable ;

Implementing write-ini is pretty easy. It's just a matter of iterating over all values in the specified assoc, and printing them out with some minor structure:

: write-ini ( assoc -- )
    [
        [ "[%s]\n" printf ] dip
        [ "%s=%s\n" printf ] assoc-each
        nl
    ] assoc-each ;

The string>ini and ini>string words are easy too. Both the read-ini and write-ini words operate on input and output streams, so we can use string streams:

: string>ini ( str -- assoc )
    [ read-ini ] with-string-reader ;

: ini>string ( assoc -- str )
    [ write-ini ] with-string-writer ;

This was a really simple implementation. In addition to the basics, I wanted to be able to support:

  • Embedded escape characters (e.g., "\t", "\n", etc.).
  • Line continuations (e.g., multi-line values).
  • Java .properties files.
  • Liberal parsing of minor formatting errors.
  • Support both '#' and ';' comment characters.
  • Quoted strings (e.g., name="value").

You can find all that and more (along with tests and some documentation) on my Github. I hope to contribute it to the main repository soon.

2 comments:

Anonymous said...

I've been puzzling over this for a while but don't understand why you used a vector sequence type below:

: [section] ( line -- section )
unwrap unspace V{ } clone 2array ;


... when in the 'section,' word definition, you unroll the section array and convert the vector into a hashtable.


Why not define [section] as:

: [section] ( line -- section )
unwrap unspace H{ } clone 2array ;

.. and save the conversion?

mrjbq7 said...

Good idea. I had originally thought to keep the properties in insertion order (maybe using an association list), but then changed to use hashing for random access.

I updated my code on Github to remove the intermediate vector.