The other day I needed a parser for INI-style configuration files. When I couldn't find a convenient Factor vocabulary to do this, I decided to write one.
A basic configuration file could look like this:
[owner] name=John Doe e-mail=john.doe@example.com [database] host=127.0.0.1 # change to production when ready port=1234 username=test password="a really long string"
These configurations are essentially groups of name/value pairs, and can be naturally expressed as an assoc. We will be implementing a simple API for reading and writing:
: read-ini ( -- assoc ) : write-ini ( assoc -- ) : string>ini ( str -- assoc ) : ini>string ( assoc -- str )
This implementation uses these vocabularies:
USING: arrays assocs combinators formatting hashtables io io.streams.string kernel make math sequences strings strings.parser ;
Some utility words are used to trim spaces from tokens, extract strings from section names (e.g., "[database]"), and remove comments from lines:
: unspace ( str -- str' ) [ " \t\n\r" member? ] trim ; : unwrap ( str -- str' ) 1 swap [ length 1 - ] keep subseq ; : uncomment ( str -- str' ) CHAR: # over index [ head ] when* ;
There are a variety of parsing strategies we could use here. To keep things simple, we will be parsing the configuration file line-by-line. Also, we will make the assumption that each line contains either a "[section]" or a "name=value" (but not both).
We know a line is a section if it starts with '[' and ends with ']':
: section? ( line -- ? ) [ first CHAR: [ = ] [ last CHAR: ] = ] bi and ;
The current section is parsed and stored as a two-element array containing the name of the section and a vector of name/value pairs:
: [section] ( line -- section ) unwrap unspace V{ } clone 2array ;
Each name/value is parsed and added to the vector of name/value pairs in the current section:
: name=value ( section line -- section' ) CHAR: = over index cut rest [ unspace ] bi@ 2array over second push ;
We will be using the make words. When we encounter a new section, or the end of the file, we will append the current section to the sequence of sections being built by make
:
: section, ( section/f -- ) [ first2 >hashtable 2array , ] when* ; : parse-line ( section line -- section' ) uncomment unspace [ dup section? [ swap section, [section] ] [ name=value ] if ] unless-empty ; : read-ini ( -- assoc ) [ f [ parse-line ] each-line section, ] { } make >hashtable ;
Implementing write-ini
is pretty easy. It's just a matter of iterating over all values in the specified assoc
, and printing them out with some minor structure:
: write-ini ( assoc -- ) [ [ "[%s]\n" printf ] dip [ "%s=%s\n" printf ] assoc-each nl ] assoc-each ;
The string>ini
and ini>string
words are easy too. Both the read-ini
and write-ini
words operate on input and output streams
, so we can use string streams:
: string>ini ( str -- assoc ) [ read-ini ] with-string-reader ; : ini>string ( assoc -- str ) [ write-ini ] with-string-writer ;
This was a really simple implementation. In addition to the basics, I wanted to be able to support:
- Embedded escape characters (e.g., "\t", "\n", etc.).
- Line continuations (e.g., multi-line values).
- Java .properties files.
- Liberal parsing of minor formatting errors.
- Support both '#' and ';' comment characters.
- Quoted strings (e.g., name="value").
You can find all that and more (along with tests and some documentation) on my Github. I hope to contribute it to the main repository soon.
I've been puzzling over this for a while but don't understand why you used a vector sequence type below:
ReplyDelete: [section] ( line -- section )
unwrap unspace V{ } clone 2array ;
... when in the 'section,' word definition, you unroll the section array and convert the vector into a hashtable.
Why not define [section] as:
: [section] ( line -- section )
unwrap unspace H{ } clone 2array ;
.. and save the conversion?
Good idea. I had originally thought to keep the properties in insertion order (maybe using an association list), but then changed to use hashing for random access.
ReplyDeleteI updated my code on Github to remove the intermediate vector.