Friday, July 15, 2011

Detecting Plagiarism

About a month ago, Tom Moertel wrote a simple plagiarism detector in Haskell. I wanted to replicate his functionality using Factor, to contrast the two solutions.

The strategy in this plagiarism detector is fairly simple: calculate "long enough" n-grams that should be fairly unique and see if they are present in a particular piece of "suspect text", converting the common pieces into UPPERCASE so that we can visually see what might be plagiarized.
```USING: command-line grouping io io.encodings.utf8 io.files
kernel math math.parser math.ranges namespaces regexp sequences
sets splitting unicode.case unicode.categories unicode.data ;

IN: plagiarism```
We can split text (using the grouping vocabulary) into consecutive groups of "n" words:
```: n-grams ( str n -- seq )
[ [ blank? ] split-when harvest ] [ <clumps> ] bi* ;```
Given a piece of text suspected to be plagiarized and some sources to compare against, we can compute the common n-grams between the two pieces of text:
```: common-n-grams ( suspect sources n -- n-grams )
[ n-grams ] curry dup [ map concat ] curry bi* intersect ;```
For each common n-gram found, we use a regular expression to find the matching part of the suspect text. The regular expression we will use, looks something like a space or start of line followed by our n-gram and ending with a space or end of line, allowing varying whitespace between words, being case insensitive, and ignoring non-letters within a word:
```: n-gram>regexp ( seq -- regexp )
[ [ Letter? not ] split-when "[\\W\\S]" join ] map
"\\s+" join "(\\s|^)" "(\\s|\$)" surround
"i" <optioned-regexp> ;```
The sequences vocabulary contains a change-nth word to modify a particular element in a sequence. We can create a word to modify several elements easily:
```: change-nths ( indices seq quot: ( elt -- elt' ) -- )
[ change-nth ] 2curry each ; inline```
Using `change-nths` and the "n-gram regexp", we can ch>upper each matching portion of text:
```: upper-matches ( str regexp -- )
[ [ [a,b) ] dip [ ch>upper ] change-nths ] each-match ;```
Using these building blocks, we can build a simple plagiarism detector:
1. Compute the common n-grams between suspect and source texts
2. Create a regular expression for each common n-gram
3. For each match, change the matching characters to uppercase
```: detect-plagiarism ( suspect sources n -- suspect' )
[ dupd ] dip common-n-grams [
dupd n-gram>regexp upper-matches
] each ;```
A "main method" enables this program to run from the command line:
```: run-plagiarism ( -- )
command-line get dup length 3 < [
drop "USAGE: plagiarism N suspect.txt source.txt..." print
] [
[ rest [ utf8 file-contents ] map unclip swap ]
[ first string>number ] bi detect-plagiarism print
] if ;

MAIN: run-plagiarism```
You can see it work by trying to find common 4-grams on some simple text (in this case, I added the word "really"):
```( scratchpad ) "this is a really long piece of text"
{ "this is a long piece of text" }
4 detect-plagiarism print
this is a really LONG PIECE OF TEXT```
It's fast enough for small examples, but not that fast for complete novels, particularly when run on texts with many common n-grams. One idea for improving the speed might be to examine the `common-n-grams` algorithm to return sequences of "n or more". This way, if the text contains a common 7-gram, and you are looking at common 4-grams, then it would have one entry instead of three.

The code for this is on my Github.