Tuesday, January 3, 2012

Duplicate Files

A few months ago, Jon Cooper wrote a duplicate file checker in Go and Ruby.

Below, I contribute a simple version in Factor that runs faster than both Go and Ruby solutions. In the spirit of the original article, I have separated the logic into steps.

Argument Parsing

The command-line vocabulary gives us the arguments passed to the script. We check for the verbose flag and the root directory to traverse:

: arg? ( name args -- args' ? )
    2dup member? [ remove t ] [ nip f ] if ;

: parse-args ( -- verbose? root )
    "--verbose" command-line get arg? swap first ;

Filesystem Traversal

We can traverse the filesystem with the each-file word (choosing breadth-first instead of depth-first). In our case, we want to collect these files into a map of all paths that share a common filename:

: collect-files ( path -- assoc )
    t H{ } clone [
        '[ dup file-name _ push-at ] each-file
    ] keep ;

Our duplicate files are those files that share a common filename:

: duplicate-files ( path -- dupes )
    collect-files [ nip length 1 > ] assoc-filter! ;

MD5 Hashing Files

Using the checksums.md5 vocabulary, it is quite simple:

: md5-file ( path -- string )
    md5 checksum-file hex-string ;

Printing Results

If verbose is selected, then we print each filename and the MD5 checksum for each full path:

: print-md5 ( name paths -- )
    [ "%s:\n" printf ] [
        [ dup md5-file "  %s\n    %s\n" printf ] each
    ] bi* ;

We put this all together by calculating the possible duplicate files, optionally printing verbose MD5 checksums, and then print the total number of duplicates detected:

: run-dupe ( -- )
    parse-args duplicate-files swap
    [ dup [ print-md5 ] assoc-each ] when
    assoc-size "Total duped files found: %d\n" printf ;

Performance

I tested performance using two directory trees, one with over 500 files and another with almost 36,000 files. While the original article focuses more on syntax than speed, it is nice to see that the Factor solution is faster than the Go and Ruby versions.

DuplicatesFactorGoRuby
5831.4532.2983.861
35,95319.08424.45230.597

The above time is seconds on my laptop.

The code for this is on my Github.

6 comments:

smartmic said...

Although the basic dupe check is very fast in Factor, applying the --verbose option slows it down beyond the pale. Is the md5 checksum the culprit? Can this be repaired?

I have 77 dupes;
Running time without --verbose option:
Ruby:
real 0m0.089s
user 0m0.072s
sys 0m0.012s

Factor:
real 0m0.033s
user 0m0.012s
sys 0m0.016s

Running time with --verbose option:
Ruby:
real 0m1.547s
user 0m0.980s
sys 0m0.552s

Factor:
real 10m44.037s
user 10m40.688s
sys 0m0.404s

mrjbq7 said...

It does appear that the Ruby --verbose is a little faster, but I'm not seeing your slowdown.

For a tree with 994 duplicate files (making sure to "USE: dupe save" so that it doesn't have to load the vocabulary each time I run it from the command line):

Factor 0.97:

$ time ./factor -run=dupe . > /dev/null

real 0m0.363s
user 0m0.183s
sys 0m0.174s

$ time ./factor -run=dupe --verbose . > /dev/null

real 0m8.435s
user 0m8.078s
sys 0m0.305s

Factor git master:

$ time ./factor -run=dupe . > /dev/null

real 0m0.263s
user 0m0.152s
sys 0m0.105s

$ time ./factor -run=dupe --verbose . > /dev/null

real 0m6.218s
user 0m5.994s
sys 0m0.209s

The MD5 checksum was improved a bit in git master versus 0.97 which explains some of the speedup, but I'm not seeing the huge slowdown you are seeing. What OS, architecture, and version of Factor are you using?

Also, if you profile it in the listener, you can maybe see where the performance loss is:

{ "--verbose" "/path/to/scan" } command-line [
[ run-dupe ] profile
] with-variable

Then to look at the profile, you can do stuff like:

flat profile.

top-down profile.

etc.

smartmic said...

I use Factor 0.97. The latest development version does not run with GUI listener. My system is Debian Wheezy x86:
$ uname -a
Linux crunchbang 3.2.0-4-686-pae #1 SMP Debian 3.2.73-2+deb7u1 i686 GNU/Linux

I setup a tree containing 2 directories with duplicate images. In order to come down to reasonable processing time for applying the profiler command, I just use one duplicate:
~/tmp/images $ tree
.
├── sub1
│   └── 20131225_181110.JPG
└── sub2
├── 20131215_105753.JPG
├── 20131215_154850.JPG

└── 20131225_181110.JPG

2 directories, 14 files

Here is my listener record:


IN: scratchpad{ "--verbose" "." } command-line [ [ run-dupe ] profile ] with-variable
20131225_181110.JPG:
/home/martin/tmp/images/sub1/20131225_181110.JPG
fe434a3cf2c5a66b3d26742f3ce003d0
/home/martin/tmp/images/sub2/20131225_181110.JPG
fe434a3cf2c5a66b3d26742f3ce003d0
Total duped files found: 1

IN: scratchpad flat profile.
depth time ms GC % JIT % FFI % FT %
0 10373.0 24.49 0.00 85.89 0.00 T{ thread f "Listener" ~curry~ ~quotation~ 40 ~box~ f t f H{ } f...
0 10371.0 24.49 0.00 85.88 0.00 add-checksum-bytes
0 10026.0 24.67 0.00 87.43 0.00 M\ md5-state checksum-block
0 3089.0 25.51 0.00 91.97 0.00 bignum-bitand
0 2640.0 24.58 0.00 87.46 0.00 (process-md5-block-G)
0 2504.0 24.52 0.00 87.50 0.00 (process-md5-block-F)
0 2437.0 22.65 0.00 88.31 0.00 (process-md5-block-I)
0 2255.0 27.54 0.00 87.85 0.00 (process-md5-block-H)
0 2115.0 22.74 0.00 91.44 0.00 bignum+
0 1903.0 35.58 0.00 87.91 0.00 alien-unsigned-cell
0 903.0 20.93 0.00 92.36 0.00 bignum-shift
0 723.0 22.13 0.00 85.34 0.00 bignum-bitor
0 439.0 21.87 0.00 86.56 0.00 bignum-bitxor
0 209.0 28.23 0.00 93.78 0.00 bignum-bitnot
0 166.0 0.00 0.00 0.00 0.00 M\ bignum >bignum
0 163.0 0.00 0.00 67.48 0.00 M\ bignum >integer
0 149.0 0.00 0.00 79.87 0.00 set-alien-unsigned-cell
0 139.0 0.00 0.00 79.14 0.00 bignum>=


The output is shortened here. Using more than 1 duplicates seems to increase time significantly and non-linear.
I wanted to try it also on my x86_64 server, but factor refused to run without X11 installed.

mrjbq7 said...

Ahh, I see the problem. Thanks for the report and the detail!

You are on 32-bit and since our fixnums are tagged values (meaning they are smaller than 32-bits because we use a couple of bits for tagged type), a lot of the math in md5 ends up being bignums.

That does kinda suck and since most of us are now working on 64-bit platforms, the performance of math on 32-bit doesn't get as much attention. I'd like to improve it though and we've discussed ways to do it but just hasn't been focus yet. Two specific ideas besides speeding up bignum math include 1) demoting values from bignum to fixnum when they get smaller and 2) having fixnum be native sized integers. Both of those are a little bit of work.

As an aside, I made an improvement to 64-bit which reduces the "dupe" time in my previous example from 6 seconds to just over 3 seconds.

Also, Factor shouldn't require X11 on your server. If you're having that problem, you can try launching the text listener directly.

./factor -run=listener

I'm not sure why it wouldn't detect you are in a tty and not an X session, so perhaps maybe open a bug for that on our Github?

https://github.com/slavapestov/factor/issues

smartmic said...

Thank you for this competent answer, now the reason is clearer to me. From my point of view, do not spent too much effort in improving the 32-bit version. But it depends on your platform deployment strategy. I installed a slim 32-bit Linux box on my very old MacMini and it works well so far for most purposes. Can Factor also be deployed on embedded / ARM architectures? Maybe it makes sense to give a hint on the Factor homepage when using the 32-bit version?
I will continue to experiment with Factor and will open a bug for the X11 thing.
Thanks again for the quick support.

mrjbq7 said...

We had a port to ARM that was almost / mostly working but it got stale and we moved the code to unmaintained/. It would be great to revive it and given how the compiler is architected, shouldn't be that difficult to get working.