Sometimes it is useful to be able to tell if a file should be treated as a stream of text or binary characters. Rather than use the file extension (which might be missing or wrong), Subversion has a simple heuristic based on the file contents:
Currently, Subversion just looks at the first 1024 bytes of the file; if any of the bytes are zero, or if more than 15% are not ASCII printing characters, then Subversion calls the file binary.
Someone implemented this in a library written in Clojure. Here's my take, but in Factor.
Some vocabularies we will use, and a namespace:
USING: io io.encodings.binary io.files kernel math sequences ; IN: text-or-binary
Checking if any of the bytes are zero:
: includes-zeros? ( seq -- ? ) 0 swap member? ;
The first 32 characters (e.g., 0-31) of ASCII are reserved for non-printing control characters. Checking that a majority (over 85%) of characters are printable (and assuming an empty sequence is printable):
: majority-printable? ( seq -- ? ) [ t ] [ [ [ 31 > ] count ] [ length ] bi / 0.85 > ] if-empty ;
Then, determining a sequence of bytes is text:
: text? ( seq -- ? ) [ includes-zeros? not ] [ majority-printable? ] bi and ;
And implementing the operation to check if a file is text or binary:
: text-file? ( path -- ? ) binary [ 1024 read text? ] with-file-reader ;
Using it is pretty easy:
( scratchpad ) "/usr/share/dict/words" text-file? . t ( scratchpad ) "/bin/sh" text-file? . f
The code for this (and some tests) is available on my Github.
3 comments:
There's an io.encodings.detection library already that does binary file detection using a trick similar to svn. You should incorporate your code into that and send it upstream. Nice work.
What if the text is in UTF-16 but in a script that's mostly ASCII compatible? You'll have plenty of null bytes then.
UTF-16 is a pretty common encoding for text files on Window. Or maybe not common, but it's what you usually get if you save someting as "Unicode".
@Joakim: That's exactly right, this was designed with ASCII in mind (also UTF-8). The io.encodings.detect vocabulary handles some unicode cases - to do this right, its probably best to have a less simple algorithm!
Post a Comment