About a month ago, Tom Moertel wrote a
simple plagiarism detector in Haskell. I wanted to replicate his functionality using
Factor, to contrast the two solutions.
The strategy in this plagiarism detector is fairly simple: calculate "long enough"
n-grams that should be fairly unique and see if they are present in a particular piece of "suspect text", converting the common pieces into UPPERCASE so that we can visually see what might be plagiarized.
USING: command-line grouping io io.encodings.utf8 io.files
kernel math math.parser math.ranges namespaces regexp sequences
sets splitting unicode.case unicode.categories unicode.data ;
IN: plagiarism
We can split text (using the
grouping vocabulary) into consecutive groups of "n" words:
: n-grams ( str n -- seq )
[ [ blank? ] split-when harvest ] [ <clumps> ] bi* ;
Given a piece of text suspected to be plagiarized and some sources to compare against, we can compute the common n-grams between the two pieces of text:
: common-n-grams ( suspect sources n -- n-grams )
[ n-grams ] curry dup [ map concat ] curry bi* intersect ;
For each common n-gram found, we use a
regular expression to find the matching part of the suspect text. The regular expression we will use, looks something like
a space or start of line followed by our n-gram and ending with a space or end of line, allowing varying whitespace between words, being case insensitive, and ignoring non-letters within a word:
: n-gram>regexp ( seq -- regexp )
[ [ Letter? not ] split-when "[\\W\\S]" join ] map
"\\s+" join "(\\s|^)" "(\\s|$)" surround
"i" <optioned-regexp> ;
The
sequences vocabulary contains a
change-nth word to modify a particular element in a sequence. We can create a word to modify several elements easily:
: change-nths ( indices seq quot: ( elt -- elt' ) -- )
[ change-nth ] 2curry each ; inline
Using
change-nths
and the "n-gram regexp", we can
ch>upper each matching portion of text:
: upper-matches ( str regexp -- )
[ [ [a,b) ] dip [ ch>upper ] change-nths ] each-match ;
Using these building blocks, we can build a simple plagiarism detector:
- Compute the common n-grams between suspect and source texts
- Create a regular expression for each common n-gram
- For each match, change the matching characters to uppercase
: detect-plagiarism ( suspect sources n -- suspect' )
[ dupd ] dip common-n-grams [
dupd n-gram>regexp upper-matches
] each ;
A "main method" enables this program to run from the command line:
: run-plagiarism ( -- )
command-line get dup length 3 < [
drop "USAGE: plagiarism N suspect.txt source.txt..." print
] [
[ rest [ utf8 file-contents ] map unclip swap ]
[ first string>number ] bi detect-plagiarism print
] if ;
MAIN: run-plagiarism
You can see it work by trying to find common 4-grams on some simple text (in this case, I added the word "really"):
( scratchpad ) "this is a really long piece of text"
{ "this is a long piece of text" }
4 detect-plagiarism print
this is a really LONG PIECE OF TEXT
It's fast enough for small examples, but not that fast for complete novels, particularly when run on texts with many common n-grams. One idea for improving the speed might be to examine the
common-n-grams
algorithm to return sequences of "n or more". This way, if the text contains a common 7-gram, and you are looking at common 4-grams, then it would have one entry instead of three.
The code for this is on my
Github.