Sherlock works on text files such as essays, computer source code files, and other assignments in digital form. It will even work with Tar files, but not compressed archives such as Gzipped or Zipped files (you'll have to unzip those files first).
sherlock *.txtThat will compare all the text files in the current directory and produce a listing of the most similar files, together with a percentage similarity index.
To compare source files, you might use it like this:
sherlock *.javaActually, it's a good idea to redirect the output into a file, so you can examine it in detail. Otherwise it'll just flash past very quickly. To redirect the output into a file, you use the > symbol:
sherlock *.java > results.txtThis creates a file called "results.txt" which contains the results.
sherlock -t 80% -z 3 -n 2 -o results.txt *.java sherlock -t 50% -o results.txt *.txt sherlock -t 0% *.java # reports all similarity indexes
The output lists the similarity indexes between each pair of files. This index is a percentage, where 0% means no similarity and 100% means there is a very high chance of a lot of similarity. 100% does not mean that the files are exactly the same, since the Sherlock program randomly throws away some data in order to perform a faster match.
The output of the program might look like this:
README and index.html: 5% README and makefile: 1% README and sherlock: 0% README and sherlock.c: 2% index.html and makefile: 8% index.html and sherlock: 0% index.html and sherlock.c: 10% makefile and sherlock: 0% makefile and sherlock.c: 6% sherlock and sherlock.c: 0%(The threshold is normally 20%, so the above output would not ordinarily be shown. The example used a threshold lowered to 0% in order to see all similarities. The numbers are fake, however, and just for illustrational purposes.)
In the example, the most similar files are index.html and sherlock.c with a rating of 10%. This means that approximately 10% of the material in those two files might be overlapping (i.e. appear in both files).
Loki combined the two programs into a single program, called Sherlock. This has some advantages and disadvantages.
The main advantage is that no intermediate files need be created. Intermediate files require disk space and a management strategy. For example, you need to decide what suffix the signature files will use (e.g. ".sig") and where they will be stored (with the data, in a parallel directory structure, or other place).
Sherlock avoids these issues but requires all files to be compared every time to want a comparison. For example, if you are looking for similarities in essays handed in by students, Sherlock may work well because you should have all the essays handed in before you begin the plagiarism detection.
On the other hand, Sherlock is not see well suited to detecting duplicate email messages, since email arrives continually, and you can never have a finished set of email to work on. Intermediate ".sig" files might reduce the time needed to compare a new email item to older items, since the signatures for old items will already be computed and stored. Sherlock would have to read and compute the signatures for all of those older files, every time you need a comparison, because it does not use ".sig" files.