Kuropiatnyk O. Constructive- synthesizing models of natural language texts for text borrowings detection in structured documents

Українська версія

Thesis for the degree of Candidate of Sciences (CSc)

State registration number

0420U100992

Applicant for

Specialization

  • 01.05.02 - Математичне моделювання та обчислювальні методи

22-06-2020

Specialized Academic Board

Д 08.084.01

National Metallurgical Academy Of Ukraine

Essay

The dissertation is devoted to solving the relevant scientific applied problem of development natural language texts models for detecting borrowings in structured documents. The dissertation reviews and analyzes existing models of natural languages, methods for processing language constructions (texts) and comparing natural language texts. Most models do not take into account the performer model and its features, and ignore human thinking processes. Most methods for preprocessing text do not allow restoring the input data after working with them. Methods based on simple comparisons (fingerprints, greedy comparisons, etc.) are sensitive to borrowings disguise mechanisms. Existing software, including those developed using these methods and algorithms, does not take into account the structure of the docu-ment to which the text belongs, when checking for borrowings. According to the results of the analysis of the current state and tendencies of development of linguistic constructions formalization methods and constructions comparison, the necessity of developing effective models and methods of natural language texts for borrowings detection in structured documents is shown. Constructive-synthesizing modeling (CSM) based on the use of formal languages and grammar apparatus, graph theory, methods and means of set theory, regression analysis were used to solve tasks of developed models of natural language texts and methods for its processing. In the framework of the CSM constructors and methods of their transformation (specialization, concretization, interpretation and realization) are used. Formalization of the forming human images processes by means of object-oriented modeling was performed. That allowed constructing a hierarchy of images based on the commonality of attributes in order to represent the meaning of the word and to reflect its connection with the objects of reality within the concept of word semantics. It was used in constructing the constructive-synthesizing languages model. Constructive-synthesizing and object-oriented models of the natural language and text, structured document model, process model of disguise of borrowings text were developed. Based on the developed models of the language, language constructions (texts) and their graph representation, method and algorithms are proposed for compare text fragments and structured documents to borrowings detection. Computer implementations of models the text graph representation and processes of disguise was created. The text-weighted graph compression method was developed to improve the performance of the computer implementation of the graph representation model, which made it possible to use the object serialization mechanism to form a structured documents database. These implementations are software for detecting borrowings in text fragments and structured documents and automated test generation to test the ability to unmask borrowings of anti-plagiarism systems. The developed models and tools form a single complex, which covers: language and speech - the creation of language constructions; lexical, syntactic, semantic components of language constructions; processes of disguise and detection of borrowings text. The time effectiveness of the implementation of models the graphical representation text is investigated. The check time for one document was from 11 to 65 sec for the database from 0.6 to 3.8 million characters. Restore graphs spent about 94% of the time. The influence of masking borrowings on increasing the originality of documents amounted to about 0.007%. Functional effectiveness metrics of developed software for borrowing in text-unstructured documents was compared to its counterpart (WCopyfind). The difference does not exceed 5%. The factors that cause the difference in the performance of the programs have been identified. The proposed model of the borrowing disguise process allows formalizing masking scenarios, creating a platform for modeling new text changes, and automating the construction of tests for anti-plagiarism systems. The integrated use obtained results allows performing an automated borrowings check of text fragments and structured documents; performing test anti-plagiarism systems and constantly increasing the test base by building new disguise scenarios. The developed software allows completing the requirement of the Law of Ukraine "On Higher Education" regarding the academic plagiarism detection in the diploma works of students in the specialty 121 "Software Engineering" at Dnipro National University of Railway Transport named after Acad. V. Lazaryan.

Files

Similar theses