Wednesday, April 29, 2015

Burrows-Wheeler Transform

The Burrows–Wheeler transform is a reversible method of rearranging text used to improve the performance of compression algorithms, such as bzip2.

We will implement transform, bwt, and inverse transform, ibwt, in both Python and Factor. First with a slow and simple algorithm, and then second with a faster version.

Version 1

We start with the pseudocode suggested in the Wikipedia article:

function BWT (string s)
   append an 'EOF' character to s
   create a table, rows are all possible rotations of s
   sort rows alphabetically
   return (last column of the table)

In Python, this might look like:

def bwt(s):
    s = s + '\0'
    n = len(s)
    m = sorted(s[i:] + s[:i] for i in range(n))
    return ''.join(x[-1] for x in m)

In Factor, using all-rotations, it might look like this:

: bwt ( seq -- seq' )
    0 suffix all-rotations natural-sort [ last ] map ;

The pseudocode to perform the inverse transform:

function inverseBWT (string s)
   create empty table 
       
   repeat length(s) times
       // first insert creates first column
       insert s as a column of table before first column
       sort rows of the table alphabetically
   return (row that ends with the 'EOF' character)

In Python, this might look like:

def ibwt(s):
    n = len(s)
    m = [''] * n
    for _ in range(n):
        m = sorted(s[i] + m[i] for i in range(n))
    return [x for x in m if x.endswith('\0')][0][:-1]

In Factor, we could implement it like this:

: ibwt ( seq -- seq' )
    [ length [ "" <array> ] keep ] keep
    '[ _ [ prefix ] 2map natural-sort ] times
    [ last 0 = ] find nip but-last ;

Unfortunately, this is very slow, with most of the performance loss in the invert transform.

Version 2

Another way to increase the speed of BWT inverse is to use an algorithm that returns an index into the sorted rotations along with the transform.

In Python, it looks like this:

def bwt(s):
    n = len(s)
    m = sorted(s[i:] + s[:i] for i in range(n))
    return m.index(s), ''.join(x[-1] for x in m)

In Factor, it might look like this:

: bwt ( seq -- i seq' )
    dup all-rotations natural-sort
    [ index ] [ [ last ] map ] bi ;

In Python, the inverse transform looks like this:

def ibwt(k, s):
    def row(k):
        permutation = sorted((t, i) for i, t in enumerate(s))
        for _ in s:
            t, k = permutation[k]
            yield t
    return ''.join(row(k))

In Factor, that roughly translates to:

: ibwt ( i seq -- seq' )
    [ length ] [ <enum> sort-values ] bi
    '[ _ nth first2 ] replicate nip ;

An improved version 2 is available in the development version. In particular, it uses rotated virtual sequences for increased performance and returns transformations that match the type of the input sequence.

Tuesday, April 28, 2015

Writing MIDI Files

Previously, I wrote about Reading MIDI Files using Factor.

Now, we are going to create a writer for MIDI files in less than 180 lines of additional code.

Variable-Length Quantity

To write a variable-length integer, we first "reverse" it, tagging the 8th bit of each additional byte. Then, we write each byte out to the output-stream.

: write-number ( n -- )
    [ 0x7f bitand ] keep

    [ -7 shift dup zero? ] [
        [ 8 shift ] dip
        [ 0x7f bitand 0x80 bitor + ] keep
    ] until drop

    [ [ -8 shift ] [ 7 bit? ] bi ]
    [ dup 0xff bitand write1 ] do while drop ;
Note: there is probably a cleaner way to do this. Patches are welcome! ☺

Text

Strings are encoded in UTF-8, prefixed with their encoded length in bytes (as a variable-length quantity).

: write-string ( str -- )
    utf8 encode [ length write-number ] [ write ] bi ;

Writing Events

The three types of events will each have to be handled differently. To do this, we will make a generic method that is given the previous status byte (to enable "running status" for MIDI events) and returns the new status byte.

GENERIC: write-event ( prev-status event -- status )

First, we write MIDI events, implementing the "running status".

: write-status ( prev-status status -- )
    dup 0xf0 < [
        [ = ] keep swap [ drop ] [ write1 ] if
    ] [
        nip write1
    ] if ;

: write-channel ( prev-status value status quot -- status )
    [
        swap [
            "channel" of + [ write-status ] keep
        ] keep
    ] dip call ; inline

M: midi-event write-event
    [ delta>> write-number ] [ value>> ] [ name>> ] tri {

        { "note-off" [
            0x80 [
                [ "note" of write1 ]
                [ "velocity" of write1 ] bi
            ] write-channel ] }
        { "note-on" [
            0x90 [
                [ "note" of write1 ]
                [ "velocity" of write1 ] bi
            ] write-channel ] }
        { "polytouch" [
            0xa0 [
                [ "note" of write1 ]
                [ "value" of write1 ] bi
            ] write-channel ] }
        { "control-change" [
            0xb0 [
                [ "control" of write1 ]
                [ "value" of write1 ] bi
            ] write-channel ] }
        { "program-change" [
            0xc0 [ "program" of write1 ] write-channel ] }
        { "aftertouch" [
            0xd0 [ "value" of write1 ] write-channel ] }
        { "pitchwheel" [
            0xe0 [
                "pitch" of min-pitchwheel -
                [ 0x7f bitand write1 ]
                [ -7 shift write1 ] bi
            ] write-channel ] }

        ! system common messages
        { "sysex" [
            [ drop 0xf0 dup write1 ] dip
            write 0xf7 write1 ] }
        { "quarter-made" [
            [ drop 0xf1 dup write1 ] dip
            [ "frame-type" of 4 shift ]
            [ "frame-value" of + ] bi write1 ] }
        { "songpos" [
            [ drop 0xf2 dup write1 ] dip
            [ 0x7f bitand write1 ]
            [ -7 shift write1 ] bi ] }
        { "song-select" [
            [ drop 0xf3 dup write1 ] dip write1 ] }
        { "tune-request" [ 2drop 0xf6 dup write1 ] }

        ! real-time messages
        { "clock" [ 2drop 0xf8 dup write1 ] }
        { "start" [ 2drop 0xfa dup write1 ] }
        { "continue" [ 2drop 0xfb dup write1 ] }
        { "stop" [ 2drop 0xfc dup write1 ] }
        { "active-sensing" [ 2drop 0xfe dup write1 ] }
        { "reset" [ 2drop 0xff dup write1 ] }
    } case ;

Next, we write meta events:

M: meta-event write-event
    [ delta>> write-number ] [ value>> ] [ name>> ] tri 
    0xff write1 {
        { "sequence-number" [
            B{ 0x00 0x02 } write 2 >be write ] }
        { "text" [ 0x01 write1 write-string ] }
        { "copyright" [ 0x02 write1 write-string ] }
        { "track-name" [ 0x03 write1 write-string ] }
        { "instrument-name" [ 0x04 write1 write-string ] }
        { "lyrics" [ 0x05 write1 write-string ] }
        { "marker" [ 0x06 write1 write-string ] }
        { "cue-point" [ 0x07 write1 write-string ] }
        { "device-name" [ 0x09 write1 write-string ] }
        { "channel-prefix" [ B{ 0x20 0x01 } write write1 ] }
        { "midi-port" [ B{ 0x21 0x01 } write write1 ] }
        { "end-of-track" [ B{ 0x2f 0x00 } write drop ] }
        { "set-tempo" [ B{ 0x51 0x03 } write 3 >be write ] }
        { "smpte-offset" [
            B{ 0x54 0x05 } write {
                [ "frame-rate" of 6 shift ]
                [ "hours" of + write1 ]
                [ "minutes" of write1 ]
                [ "seconds" of write1 ]
                [ "frames" of write1 ]
                [ "subframes" of write1 ]
            } cleave ] }
        { "time-signature" [
            B{ 0x58 0x04 } write {
                [ "numerator" of write1 ]
                [ "denominator" of 2 /i write1 ]
                [ "clocks-per-tick" of write1 ]
                [ "notated-32nd-notes-per-beat" of write1 ]
            } cleave ] }
        { "key-signature" [
            B{ 0x59 0x02 } write
            key-signatures value-at write ] }
        { "sequencer-specific" [
            0x7f write1
            [ length write-number ] [ write ] bi ] }
    } case drop f ;

Finally, we write system-exclusive events:

M: sysex-event write-event
    drop
    [ delta>> write-number ]
    [ type>> write1 ]
    [ bytes>> write ] tri f ;

Writing a MIDI header and tracks, generically as "chunks":

GENERIC: write-chunk ( chunk -- )

M: midi-header write-chunk
    $[ "MThd" >byte-array ] write
    $[ 6 4 >be ] write
    [ format>> ] [ #chunks>> ] [ division>> ] tri
    [ 2 >be write ] tri@ ;

M: midi-track write-chunk
    $[ "MTrk" >byte-array ] write
    binary [
        events>> f swap [ write-event ] each drop
    ] with-byte-writer
    [ length 4 >be write ] [ write ] bi ;

Finally, words to write MIDI objects, either to a byte-array, or to a file.

: write-midi ( midi -- )
    [ header>> write-chunk ]
    [ chunks>> [ write-chunk ] each ] bi ;

: midi> ( midi -- byte-array )
    binary [ write-midi ] with-byte-writer ;

: midi>file ( midi path -- )
    binary [ write-midi ] with-file-writer ;

This is available now in the midi vocabulary.

Friday, April 24, 2015

Reading MIDI Files

MIDI is a specification for music, describing how electronic musical instruments and computers can communicate with each other.

Unlike digital audio formats such as MP3, the Standard MIDI File does not contain sounds, but rather a stream of instructions for playing notes, volume, tempo, and sound effects, as well as track names and other descriptive information. Because of this, MIDI files tend to be much smaller and typically allow the music to be easily rearranged or edited.

Using Factor, we will be creating a parser for reading MIDI files in under 180 lines of code.

Variable-Length Quantity

Some integers will be encoded as variable length, using 7 bits per byte with one bit reserved for the stop bit (indicating you have finished reading the number). This means the numbers 0 through 127 can be encoded in a single byte, but larger numbers will require additional bytes.

: read-number ( -- number )
    0 [ 7 bit? ] [
        7 shift read1 [ 0x7f bitand + ] keep
    ] do while ;

MIDI Events

There are three types of events: MIDI events, system-exclusive events, and meta events. The majority of events will usually be MIDI events, so we will parse those first.

Some MIDI events will include the channel in 4 bits of the status byte, so we handle those separately from the system common and realtime messages.

TUPLE: midi-event delta name value ;

C: <midi-event> midi-event
: read-message ( delta status -- message )
    dup 0xf0 < [
        [
            ! channel messages
            [ 0x0f bitand "channel" ,, ] [ 0xf0 bitand ] bi {
                { 0x80 [ "note-off"
                    read1 "note" ,, read1 "velocity" ,, ] }
                { 0x90 [ "note-on"
                    read1 "note" ,, read1 "velocity" ,, ] }
                { 0xa0 [ "polytouch"
                    read1 "note" ,, read1 "value" ,, ] }
                { 0xb0 [ "control-change"
                    read1 "control" ,, read1 "value" ,, ] }
                { 0xc0 [ "program-change"
                    read1 "program" ,, ] }
                { 0xd0 [ "aftertouch"
                    read1 "value" ,, ] }
                { 0xe0 [ "pitchwheel"
                    read1 read1 7 shift + "pitch" ,, ] }
            } case
        ] H{ } make
    ] [
        {
            ! system common messages
            { 0xf0 [ "sysex" { 0xf7 } read-until drop ] }
            { 0xf1 [ "quarter-made" [
                    read1
                    [ -4 shift "frame-type" ,, ]
                    [ 0x0f bitand "frame-value" ,, ] bi
                ] H{ } make ] }
            { 0xf2 [ "songpos" read1 read1 7 shift + ] }
            { 0xf3 [ "song-select" read1 ] }
            { 0xf6 [ "tune-request" f ] }

            ! real-time messages
            { 0xf8 [ "clock" f ] }
            { 0xfa [ "start" f ] }
            { 0xfb [ "continue" f ] }
            { 0xfc [ "stop" f ] }
            { 0xfe [ "active-sensing" f ] }
            { 0xff [ "reset" f ] }
        } case
    ] if <midi-event> ;

Meta Events

Meta events contain descriptive information such as track name, tempo and time signatures. They are also used to indicate the end of the track has been reached.

TUPLE: meta-event delta name value ;

C: <meta-event> meta-event
: parse-meta ( status bytes -- name value )
    swap {
        { 0x00 [ 2 head be> "sequence-number" ] }
        { 0x01 [ utf8 decode "text" ] }
        { 0x02 [ utf8 decode "copyright" ] }
        { 0x03 [ utf8 decode "track-name" ] }
        { 0x04 [ utf8 decode "instrument-name" ] }
        { 0x05 [ utf8 decode "lyrics" ] }
        { 0x06 [ utf8 decode "marker" ] }
        { 0x07 [ utf8 decode "cue-point" ] }
        { 0x09 [ utf8 decode "device-name" ] }
        { 0x20 [ first "channel-prefix" ] }
        { 0x21 [ first "midi-port" ] }
        { 0x2f [ drop t "end-of-track" ] }
        { 0x51 [ 3 head be> "set-tempo" ] }
        { 0x54 [
            [
                5 firstn {
                    [
                        [ -6 shift "frame-rate" ,, ]
                        [ 0x3f bitand "hours" ,, ] bi
                    ]
                    [ "minutes" ,, ]
                    [ "seconds" ,, ]
                    [ "frames" ,, ]
                    [ "subframes" ,, ]
                } spread
            ] H{ } make "smpte-offset" ] }
        { 0x58 [
            [
                first4 {
                    [ "numerator" ,, ]
                    [ 2 * "denominator" ,, ]
                    [ "clocks-per-tick" ,, ]
                    [ "notated-32nd-notes-per-beat" ,, ]
                } spread
            ] H{ } make "time-signature" ] }
        { 0x59 [ "key-signature" ] }
        { 0x7f [ "sequencer-specific" ] }
    } case swap ;

: read-meta ( delta -- event )
    read1 read-number read parse-meta <meta-event> ;

Sysex Events

For system-exclusive events, which are typically a sequence of bytes that are proprietary to particularly MIDI devices, we just preserve the type (0xf0 or 0xf7) and raw bytes.

TUPLE: sysex-event delta status bytes ;

C: <sysex-event> sysex-event
: read-sysex ( delta status -- event )
    read-number read <sysex-event> ;

Reading Events

We can now read all types of events, dispatching on the status byte.

: read-event ( delta status -- event )
    {
        { 0xf0 [ 0xf0 read-sysex ] }
        { 0xf7 [ 0xf7 read-sysex ] }
        { 0xff [ read-meta ] }
        [ read-message ]
    } case ;

Status bytes can be "running", which means that for channel events they can be dropped from the stream if they are identical to the previous MIDI channel event. Meta events (0xff) do not set the running status.

: read-status ( prev-status -- prev-status' status )
    peek1 dup 0x80 < [
        drop dup
    ] [
        drop read1 dup 0xff = [
            nip dup
        ] unless
    ] if ;

Each event has a header that is the delta-time (encoded as a variable length integer) and the status (which may not be present if it is "running").

: read-event-header ( prev-status -- prev-status' delta status )
    [ read-number ] dip read-status swapd ;

There are a few ways to parse all events from a byte-array, but I thought it was a good opportunity to try out peekable streams, checking if the next event is present.

: parse-events ( data -- events )
    binary <byte-reader> <peek-stream> [
        f [
            peek1 [ read-event-header ] [ f f ] if dup
        ] [ read-event ] produce 2nip nip
    ] with-input-stream ;

Reading MIDI

MIDI files are grouped into a series of chunks. The first chunk is a MIDI header indicating the format (single or multiple simultaneous tracks), number of tracks in the file, and division (indicating how to interpret the delta-times in the file).

TUPLE: midi-header format #chunks division ;

: <midi-header> ( bytes -- header )
    2 cut 2 cut [ be> ] tri@ midi-header boa ;

Typically, that is followed by MIDI tracks, each containing a series of events.

TUPLE: midi-track events ;

: <midi-track> ( bytes -- track )
    parse-events midi-track boa ;

Reading the chunks in the file dispatch off the "chunk type":

: read-chunk ( -- chunk )
    4 read 4 read be> read swap {
        { $[ "MThd" >byte-array ] [ <midi-header> ] }
        { $[ "MTrk" >byte-array ] [ <midi-track> ] }
    } case ;

To read a MIDI stream, we read the header and then all the chunks in the file, storing them in a midi tuple.

TUPLE: midi header chunks ;

C: <midi> midi

: read-header ( -- header )
    read-chunk dup midi-header? t assert= ;

: read-chunks ( header -- chunks )
    #chunks>> [ read-chunk ] replicate ;

: read-midi ( -- midi )
    read-header dup read-chunks <midi> ;

Parsing a MIDI from raw bytes or a file:

: >midi ( byte-array -- midi )
    binary [ read-midi ] with-byte-reader ;

: file>midi ( path -- midi )
    binary [ read-midi ] with-file-reader ;

This is available now in the midi vocabulary.

Monday, April 20, 2015

Long URLs

In a world of 140 characters, space is at a premium (even for the longest tweet ever). It has become very common to shorten URLs when embedding links.

There are a lot of URL shortening services available, including branded ones such as t.co (Twitter), goo.gl (Google), nyti.ms (New York Times), and youtu.be (YouTube). You might not know it, but Factor even includes one in the wee-url web application.

You could use something like the LongURL service to resolve short URLs back to the long URL they point to, but I thought it would be more fun to show how to use Factor to do it!

By default, our http.client automatically follows redirects until exceeding a configurable maximum. We will need to make requests that do not redirect, using HEAD to retrieve only the HTTP headers and not the full contents:

: http-head-no-redirects ( url -- response data )
    <head-request> 0 >>redirects http-request* ;

We use symbols to configure a maximum number of redirects (defaulting to 5) and to store the current number of redirects.

SYMBOL: max-redirects
5 max-redirects set-global

SYMBOL: redirects

We want a word that takes a URL and retrieves the next URL, if redirected. If we exceed our maximum number of redirects, it should throw an error.

: next-url ( url -- next-url redirected? )
    redirects inc
    redirects get max-redirects get <= [
        dup http-head-no-redirects drop
        dup redirect? [
            nip "location" header t
        ] [ drop f ] if
    ] [ too-many-redirects ] if ;

To find the "long URL", just loop until we are no longer redirected:

: long-url ( short-url -- long-url )
    [ [ next-url ] loop ] with-scope ;

To see it work, we can try it out with a short URL that I just made:

IN: scratchpad "http://bit.ly/1J0vm1x" long-url .
"http://factorcode.org/"

Neat!

This code is available on my GitHub.

Saturday, April 18, 2015

Interpolate

Today, I made some minor improvements to the interpolate vocabulary, which provides simple string interpolation and formatting.

We have had the ability to use named variables:

IN: scratchpad "World" "name" set
               "Hello, ${name}" interpolate
Hello, World

But now we can just as easily use stack arguments (numbered from the top of the stack):

IN: scratchpad "Mr." "Anderson"
               "Hello, ${1} ${0}" interpolate
Hello, Mr. Anderson

In any order, even repeated:

IN: scratchpad "James" "Bond"
               "${0}, ${1} ${0}" interpolate
Bond, James Bond

As well as anonymously, by order of arguments:

IN: scratchpad "Roses" "red"
               "${} are ${}" interpolate
Roses are red

And even mix named variables and stack arguments:

IN: scratchpad "Factor" "lang" set
               "cool" "${lang} is ${0}!" interpolate
Factor is cool!

Right now we simply convert objects to human-readable strings using the present vocabulary. In the future, it would be nice to support something like Python's string format specifications, which are similar but slightly different than our printf support.