Dikhtiarenko O. Information technology of matching approximate duplicates within the content of electronic documents

Українська версія

Thesis for the degree of Candidate of Sciences (CSc)

State registration number

0415U003874

Applicant for

Specialization

  • 05.13.06 - Інформаційні технології

30-06-2015

Specialized Academic Board

Д 26.056.01

Kyiv National University of Construction and Architecture

Essay

The thesis is devoted to the problem of plagiarism scientific and other works, which is growing in scale due to the growth of computer literacy and Internet penetration in all spheres of life. Within the framework of the research, new models and methods were developed. These models and methods allow to find fuzzy matches in the text, images and tables within the document, even if content structure was modified. A conceptual model of technology definition of fuzzy matches, the model index of the document and the model of fuzzy matches was developed. For the texts preparation proposed an approach for correcting errors in words, discarding stop-words and stop-phrases, methods of canonization and replacement techniques for synonyms and antonyms. Developed a method for creating document indexes using locality-sensitive hashing, and method for filtering spurious matches. For graphic images, the methods of fragmentation and the definition of the reference rotation angle of the image are proposed. Developed a method of determining the header of the table and a way to index the table by columns and rows. Also a method of clustering documents using the frequency of use of words in the document as signs of clustering has been enhanced. Offered the techniques of creating models of extended index document, which will speed up the execution of the search process. Developed the system architecture definition of fuzzy matches. Implementation of the system as the software allows to identify matches in all types of documents. The system can be used by higher educational institutions that will improve the level of training of specialists. It can be used by scientific periodicals that will help to prevent fraud and misappropriation of other people's work as well as in other structures.

Files

Similar theses