Novel Way to Compare Documents With High Precision

by scienstein in Circuits > Tools

1083 Views, 5 Favorites, 0 Comments

Novel Way to Compare Documents With High Precision

Intro.png

Document comparison is commonly used for the purposes of document editing and review. Through the redlining, a process by which changes are identified between two versions of the same document, the user may determine the difference between the comparing documents. Document comparison is a common task in the legal and financial industries as well as in the academia for anti-Plagiarism. Due to the conveniences bestow by the IT technology today, softwares are one of the most common way to perform document comparison. However, due to the design of different softwares, there are different requirements for the user especially in the operating environment and knowledge to the software. In this instructable, we propose a novel way of comparing documents by a simple, reliable and economical technique.


Requirement:
1) Any personal computers (with internet access for the first time)

Get Ready

step_0.png
step_1.png

ClustalW is a bioinformatics program commonly used for the sequence analysis of nucleic acids and proteins. It can be eaily searched by any internet searching engine with the keyword "clustalw".

There are many ClustalW programs worldwide but they are more or less the same. For the purpose of demonstration, the European server (EMBL-EBI) is selected. Offline versions are available for download in this server.

Limitations:
1) The program treat all terms/letters as capital letters, i.e. can not distinguish between capital and small letters.
2) The program can not compare any characters other than A-Z, i.e. all symbols other than A-Z are omitted.
3) The results will display the contents of paragraph as a continuous string of terms/letters, i.e. all spaces between words are omitted.

Preparation of the Dossiers

step_2.png
step_3.png
step_4.png

Before proceeding with the document analysis, the user is highly recommended to prepare a dossier required for the analysis. This can be done through a word processing software, e.g. Notepad, MS-Word.

The format is as follows:
1) Different paragraphs for the comparison is identified by the header which starts with a ">>" and then followed by "NAME".
2) Within the "NAME", it MUST BE continuous without spaces.
3) The user may copy their desired contents for comparison under the header.
4) Spaces between the header and the contents are allowed.
5) Spaces and any characters within the contents is allowed.
6) The counterparts for comparison may place below the contents of the previous paragraph starting with a distinct new header.
7) The dossier must be saved as TXT file format/extension.

Assign the Dossier to the Program

step_5a.png
step_5b.png
step_5c.png

For those users who do not want to prepare their dossiers, they may copy and paste their desired contents for comparison into the spaces provided in the program. All users must follow the specified format for the contents layout.

Otherwise, all users may specify the location of their prepared dossiers for the system through the button "Browse".

Before start, make sure the type of sequence is set to "Protein". The program will start analyzing the contents after pressing the "Submit" button.

Analysis of the Dossiers

step_6.png
step_7.png
Result.png

After analysis, the results will be displayed in a particular format in which all like terms/letters are aligned against one another whilst the unlike terms/letters are separated from other terms/letters.

Legend:
Asterisk (*) means exact term/letter
Dash (-) means connections between terms/letters, usually emerges when there are extra terms/letters present in one of the paragraphs
Colon (:) means strongly similar term/letter in its original biological meaning (can be ignored for the purpose of document comparison)
Period (.) means weakly similar term/letter in its original biological meaning (can be ignored for the purpose of document comparison)

For a better visual experience, users may switch on the colors for the terms/letters.
Users may look for the similarity scores (%) of the paragraphs involved after analysis in the "Result Summary" section of the ClustalW Results page.

Results

Result_1.png
Result_2.png

After running the program with different paragraphs with similar contents, e.g. some with different characters in a particular phrase whilst the others with extra symbols and spaces, the contents comparison between different paragraphs worked well and the ClustalW system recognised all differences in terms/letters correctly and precisely. However, all differences regarding the symbols and the status of the capital letters were not recognised. Additionally, some ClustalW programs (worldwide) may not compatible with the dossier method.


Attachment:
*) test.txt sample test dossier is enclosed

Downloads

Summary


This tutorial provides you a new way to compare the contents of the documents in addition to the application of the common file comparing softwares. Further, as this method is inexpensive, simple, flexible and easily accessible, it offers an advantage in both the temporal and spatial resolution over a fixed operating PC environment, e.g. in a public library or computer labs.


Acknowledgements
The Hong Kong Polytechnic University


References
  1. Document comparison (http://en.wikipedia.org/wiki/Document_comparison)
  2. EMBL-EBI: ClustalW (http://www.ebi.ac.uk/Tools/msa/clustalw2/)
  3. GenomeNet (http://www.genome.jp/tools-bin/clustalw)