Friday, July 15, 2011

Detecting Plagiarism

About a month ago, Tom Moertel wrote a simple plagiarism detector in Haskell. I wanted to replicate his functionality using Factor, to contrast the two solutions.

The strategy in this plagiarism detector is fairly simple: calculate "long enough" n-grams that should be fairly unique and see if they are present in a particular piece of "suspect text", converting the common pieces into UPPERCASE so that we can visually see what might be plagiarized.
USING: command-line grouping io io.encodings.utf8 io.files
kernel math math.parser math.ranges namespaces regexp sequences
sets splitting unicode.categories ;

IN: plagiarism
We can split text (using the grouping vocabulary) into consecutive groups of "n" words:
: n-grams ( str n -- seq )
    [ [ blank? ] split-when harvest ] [ <clumps> ] bi* ;
Given a piece of text suspected to be plagiarized and some sources to compare against, we can compute the common n-grams between the two pieces of text:
: common-n-grams ( suspect sources n -- n-grams )
    [ n-grams ] curry dup [ map concat ] curry bi* intersect ;
For each common n-gram found, we use a regular expression to find the matching part of the suspect text. The regular expression we will use, looks something like a space or start of line followed by our n-gram and ending with a space or end of line, allowing varying whitespace between words, being case insensitive, and ignoring non-letters within a word:
: n-gram>regexp ( seq -- regexp )
    [ [ Letter? not ] split-when "[\\W\\S]" join ] map
    "\\s+" join "(\\s|^)" "(\\s|$)" surround
    "i" <optioned-regexp> ;
The sequences vocabulary contains a change-nth word to modify a particular element in a sequence. We can create a word to modify several elements easily:
: change-nths ( indices seq quot: ( elt -- elt' ) -- )
    [ change-nth ] 2curry each ; inline
Using change-nths and the "n-gram regexp", we can ch>upper each matching portion of text:
: upper-matches ( str regexp -- )
    [ [ [a,b) ] dip [ ch>upper ] change-nths ] each-match ;
Using these building blocks, we can build a simple plagiarism detector:
  1. Compute the common n-grams between suspect and source texts
  2. Create a regular expression for each common n-gram
  3. For each match, change the matching characters to uppercase
: detect-plagiarism ( suspect sources n -- suspect' )
    [ dupd ] dip common-n-grams [
        dupd n-gram>regexp upper-matches
    ] each ;
A "main method" enables this program to run from the command line:
: run-plagiarism ( -- )
    command-line get dup length 3 < [
        drop "USAGE: plagiarism N suspect.txt source.txt..." print
    ] [
        [ rest [ utf8 file-contents ] map unclip swap ]
        [ first string>number ] bi detect-plagiarism print
    ] if ;

MAIN: run-plagiarism
You can see it work by trying to find common 4-grams on some simple text (in this case, I added the word "really"):
( scratchpad ) "this is a really long piece of text"
               { "this is a long piece of text" }
               4 detect-plagiarism print
this is a really LONG PIECE OF TEXT
It's fast enough for small examples, but not that fast for complete novels, particularly when run on texts with many common n-grams. One idea for improving the speed might be to examine the common-n-grams algorithm to return sequences of "n or more". This way, if the text contains a common 7-gram, and you are looking at common 4-grams, then it would have one entry instead of three.

The code for this is on my Github.

No comments: