Thursday, November 10, 2011

wc -l

The wc program is a utility that can count the number of lines in a file. It has a number of options that are described on its man page that can change its function to count characters or word or produce the length of the longest line.

For the last month, Joe Groff has been improving the performance of Factor's I/O libraries. Yesterday, we were investigating slow performance when doing lots of small reads from a file. Joe was able to make a number of nice speedups, which will be in the next release of Factor.

After suggesting that we use wc -l as a benchmark to aspire to, we came up with several approaches with varying performance. I want to demonstrate these, using timing information from running it on my computer. Although the Factor image file is binary data, we are going to count the number of newline characters in it.

Note: some of these require the latest development version of Factor to run.

Our "gold standard" will be wc -l, which takes just over 0.1 seconds:

$ time wc -l factor.image
 6149212 factor.image

real 0m0.111s
user 0m0.090s
sys  0m0.020s

Our first attempt in Factor is the shortest amount of code but takes 5.8 seconds (you can time this by running "USE: system [ image wc-file-lines ] time"):

: wc-file-lines ( path -- n )
    binary file-lines length ;

We don't really need the lines, just their count, so perhaps just increment each-line in a loop. This is an improvement at just over 3 seconds:

: wc-each-line ( path -- n )
    binary [ 0 [ drop 1 + ] each-line ] with-file-reader ;

Trying to use read-until to look for the next newline, is a bit slower at 3.4 seconds:

: wc-read-until ( path -- n )
    binary [
        0 [ "\n" read-until [ drop 1 + ] dip ] loop
    ] with-file-reader ;

Instead of reading each line at a time, we can just read 65,536 byte blocks and count the number of newlines. This takes about 1.5 seconds:

: wc-each-block ( path -- n )
    binary [
        0 [ [ CHAR: \n = ] count + ] each-block
    ] with-file-reader ;

Since we are only counting characters in each block, we don't need to allocate and copy the bytes out of the I/O buffer. Instead, we can look at a slice. This takes about 1 second:

: wc-each-block-slice ( path -- n )
    binary [
        0 [ [ CHAR: \n = ] count + ] each-block-slice
    ] with-file-reader ;

The stream functions (such as read) operate on an input-stream dynamic variable, which introduces some overhead. If we remove that, the compiler can eliminate some of the dynamic dispatches. taking 0.320 seconds:

: wc-fast ( path -- n )
    binary <file-reader> [
        0 swap [
            [ CHAR: \n = ] count +
        ] each-stream-block-slice
    ] with-disposal ;

If we make an assumption that the number of lines in a file will fit into a fixnum, then we can get a bit faster at 0.240 seconds:

: wc-faster ( path -- n )
    binary <file-reader> [
        0 swap [
            [ CHAR: \n = ] count + >fixnum
        ] each-stream-block-slice
    ] with-disposal ;

And, if we cheat and just run the wc -l process directly, we can get to 0.210 seconds:

: wc-system ( path -- n )
    "wc -l " prepend utf8 [
        readln " " split harvest first string>number
    ] with-process-reader ;

Overall, not too bad! Factor is getting within shouting distance of programs written in "faster" languages. Probably with a few more hours, we could close the gap, but thats enough for today.

Update: using SIMD, Joe was able to get the time down to 0.120 seconds!
: aligned-slices ( seq -- head tail )
   dup length 15 bitnot bitand cut-slice ; inline

: wc-simd ( path -- n )
       0 swap binary <file-reader> &dispose [
           aligned-slices [
               uchar-16 cast-array swap
               [ 10 uchar-16-with v= vcount + >fixnum ] reduce
           ] [ [ 10 = ] count + >fixnum ] bi*
       ] each-stream-block-slice
   ] with-destructors ;


asdf said...

I think you meant "with-disposal" rather than "with-destructors" in the wc-fast and wc-faster examples.

dzsekijo said...

This is an interesting post and raises a lot of questions for a factor newb. I got lost from the half of it, I'd appreciate some elaboration...

- wc-each-block-slice: what's the trick here? How does it exactly improves on the previous one? I think I understand what a slice is but I don't see how it is made use of in I/O context.
- wc-fast: regarding the need with-destructors (or with-disposal?) -- I see that each-block-slice is defined in terms of each-stream-block-slice, but in that definition there is no with-destructors. Why do we need it at the direct invocation of each-block-slice when it's not there when the call to each-block-slice is done in the lib?
- wc-faster: AFAICS >fixnum converts an existing value to a fixnum. So naively it seems that it's actually an overhead -- + gives back whatever value it is to give back, then post festam it's casted to fixnum. (I would understand if you used explicit fixnum arithmetics like fixnum+, but you use generic arithmetics and a subsequent fixnum cast!) I suppose the compiler is doing some tricks here. What pattern is being adhered to here which ensures that the compiler can do the optimization and use fixnums all over?
- wc-system: this is actually the weirdest. As you say, by a command line invocation of "wc -l" you terminate in 0.111 secs, while, if you do it through factor, you get 0.21, almost the double of that! How come that process spawning has such a huge overhead compared to shell? Factor is as advanced that it can utilize SIMD instructions but cannot fire up a process efficiently?

Doug Coleman said...

wc-each-block-slice works by returning a slice of the buffer that holds the stream data instead of copying the data from that buffer and working on it, as each-block does.

wc-fast should use with-disposal, as right now the file-reader object is leaking. The disposal is for the file-reader, which holds the file descriptor. The garbage collector takes care of cleaning up any memory used by each-block-slice.

wc-fast uses + the generic word, which for integers, could add two bignums, or even overflow from fixnums to a bignum result. By calling >fixnum after the fact in wc-faster, the + can be replaced with fixnum+ and the overflow check eliminated. Just calling fixnum+ itself is not enough, as the result could overflow.

For wc-system, it seems to be overhead in waiting for the result of the program. We open a process-reader and read the output from the stdout of the wc program. There is probably room to optimize our thread scheduler here, however the time penalty doesn't scale with the amount of work the system program is doing. We're adding about .04s of overhead (.176s wc -l from my terminal, .216s from Factor.)

Doug Coleman said...

Side note: the Factor SIMD version is faster than the system wc on my machine. (0.149s vs 0.165s)

Also, to run it, you need to do these at the top:

USE: math.vectors.simd
SIMD-128: uchar-16