The TXON, also known as "Text Object Notation", is a proposed format for structured data.
Much less popular than other formats such as JSON, XML, or even INI files - I thought it would still be fun to implement encode and decode words in Factor.
An example TXON
might look something like this:
Factor:` url:`http://factorcode.org` development:`Started in 2003` license:`Open source (BSD license)` influences:`Forth, Lisp, and Smalltalk` `
Encoding
Since TXON uses "`" characters to delimit values, we need to escape them:
: encode-value ( string -- string' ) R" `" "\\`" re-replace ;
To implement encoding in a generic way, we dispatch on the type of object being encoded:
GENERIC: >txon ( object -- string ) M: sequence >txon [ >txon ] map "\n" join ; M: assoc >txon >alist [ first2 [ encode-value ] [ >txon ] bi* "%s:`%s`" sprintf ] map "\n" join ; M: string >txon encode-value ; M: number >txon number>string >txon ;
Decoding
Although the TXON specification includes an EBNF grammar, I am going to show one way to build a parser from scratch. In the tradition of concatenative languages, we will build our decoder from several smaller words.
For symmetry with the encode-value
word, we need a way to unescape the `
characters:
: decode-value ( string -- string' ) R" \\`" "`" re-replace ;
Since the TXON format is a series of name:`value`
pairs, we can parse the name by finding the separator and then decoding the name (which might contain escaped characters):
: parse-name ( string -- remain name ) ":`" split1 swap decode-value ;
To build a word that finds the first (unescaped) `
character, we will first make a word that looks at adjacent characters, returning true if the second character is an unescaped `
:
: `? ( ch1 ch2 -- ? ) [ CHAR: \ = not ] [ CHAR: ` = ] bi* and ;
By grouping the string into adjacent characters, we can find the first unescaped `
(specially handling the case where the first character is an `
):
: (find-`) ( string -- n/f ) 2 clump [ first2 `? ] find drop [ 1 + ] [ f ] if* ; : find-` ( string -- n/f ) dup ?first CHAR: ` = [ drop 0 ] [ (find-`) ] if ;
Parsing the value is slightly complicated by the fact that TXON supports values which might themselves be a single value, a sequence of values, or a series of name/value pairs. Basically, that means we need to:
- find the first
`
character - checks if the previous character is a
:
(indicating a name/value) - parse all name/values if so, otherwise decode the value(s)
That algorithm can be translated into this code:
DEFER: name/values : (parse-value) ( string -- values ) decode-value string-lines dup length 1 = [ first ] when ; : parse-value ( string -- remain value ) dup find-` [ dup 1 - pick ?nth CHAR: : = [ drop name/values ] [ cut swap (parse-value) ] if [ rest [ blank? ] trim-head ] dip ] [ f swap ] if* ;
We want to parse a "name=value" pair, which should be as easy as parsing the name, then the value, then associating into a hashtable:
: (name=value) ( string -- remain term ) parse-name [ parse-value ] dip associate ;
The string might contain a "name=value" pair, or just a single value:
: name=value ( string -- remain term ) [ blank? ] trim ":`" over subseq? [ (name=value) ] [ f swap ] if ;
We finish by building a word to produce all "name=value" pairs, used in the parse-value
word earlier.
: name/values ( string -- remain terms ) [ dup { [ empty? not ] [ first CHAR: ` = not ] } 1&& ] [ name=value ] produce assoc-combine ;
Putting all of that together, we can make a word to parse a TXON string, producing "name=value" pairs until exhausted:
: parse-txon ( string -- objects ) [ dup empty? not ] [ name=value ] produce nip ; : txon> ( string -- object ) parse-txon dup length 1 = [ first ] when ;
Try It
You can try this out in the listener:
IN: scratchpad H{ { "a" "123" } } >txon . "a:`123`" IN: scratchpad "a:`123`" txon> . H{ { "a" "123" } }
Can you improve on this? Maybe by using the peg.ebnf
vocabulary to create an EBNF parsing word?
The code for this (and a bunch of tests) are on my Github.