The strategy in this plagiarism detector is fairly simple: calculate "long enough" n-grams that should be fairly unique and see if they are present in a particular piece of "suspect text", converting the common pieces into UPPERCASE so that we can visually see what might be plagiarized.
We can split text (using the grouping vocabulary) into consecutive groups of "n" words:USING: command-line grouping io io.encodings.utf8 io.files kernel math math.parser math.ranges namespaces regexp sequences sets splitting unicode.case unicode.categories unicode.data ; IN: plagiarism
Given a piece of text suspected to be plagiarized and some sources to compare against, we can compute the common n-grams between the two pieces of text:: n-grams ( str n -- seq ) [ [ blank? ] split-when harvest ] [ <clumps> ] bi* ;
For each common n-gram found, we use a regular expression to find the matching part of the suspect text. The regular expression we will use, looks something like a space or start of line followed by our n-gram and ending with a space or end of line, allowing varying whitespace between words, being case insensitive, and ignoring non-letters within a word:: common-n-grams ( suspect sources n -- n-grams ) [ n-grams ] curry dup [ map concat ] curry bi* intersect ;
The sequences vocabulary contains a change-nth word to modify a particular element in a sequence. We can create a word to modify several elements easily:: n-gram>regexp ( seq -- regexp ) [ [ Letter? not ] split-when "[\\W\\S]" join ] map "\\s+" join "(\\s|^)" "(\\s|$)" surround "i" <optioned-regexp> ;
Using: change-nths ( indices seq quot: ( elt -- elt' ) -- ) [ change-nth ] 2curry each ; inline
change-nths
and the "n-gram regexp", we can ch>upper each matching portion of text:Using these building blocks, we can build a simple plagiarism detector:: upper-matches ( str regexp -- ) [ [ [a,b) ] dip [ ch>upper ] change-nths ] each-match ;
- Compute the common n-grams between suspect and source texts
- Create a regular expression for each common n-gram
- For each match, change the matching characters to uppercase
A "main method" enables this program to run from the command line:: detect-plagiarism ( suspect sources n -- suspect' ) [ dupd ] dip common-n-grams [ dupd n-gram>regexp upper-matches ] each ;
You can see it work by trying to find common 4-grams on some simple text (in this case, I added the word "really"):: run-plagiarism ( -- ) command-line get dup length 3 < [ drop "USAGE: plagiarism N suspect.txt source.txt..." print ] [ [ rest [ utf8 file-contents ] map unclip swap ] [ first string>number ] bi detect-plagiarism print ] if ; MAIN: run-plagiarism
It's fast enough for small examples, but not that fast for complete novels, particularly when run on texts with many common n-grams. One idea for improving the speed might be to examine the( scratchpad ) "this is a really long piece of text" { "this is a long piece of text" } 4 detect-plagiarism print this is a really LONG PIECE OF TEXT
common-n-grams
algorithm to return sequences of "n or more". This way, if the text contains a common 7-gram, and you are looking at common 4-grams, then it would have one entry instead of three.The code for this is on my Github.
No comments:
Post a Comment