Todoriko O. Models and methods of cleaning and integration of text data in information systems

Українська версія

Thesis for the degree of Candidate of Sciences (CSc)

State registration number

0416U003873

Applicant for

Specialization

  • 05.13.06 - Інформаційні технології

30-06-2016

Specialized Academic Board

Д 64.050.07

National Technical University "Kharkiv Polytechnic Institute"

Essay

The object of study is processes of cleaning and integration of data in information reference system and information retrieval system. The purpose of the research is to improve technology of cleaning and integration of text data in information reference systems and information retrieval systems, through the use of model of an inflectional paradigm and methods of creation of a lexeme index at the organization of a tolerant retrieval. Research methods include methods of mathematical modeling, methods of object-oriented software analysis and synthesis using the unified modeling language, methods of creation linear and neural classifiers, methods of probability theory and statistical analysis of experimental data. Theoretical and practical results: the creation of software in the form of a class library in Java for organizing lexical similarity search and integration of data sets. Scientific novelty of the results: for the first time: - a model of the inflectional paradigm was developed, which differs from the existing ones in a way it represents words and calculates an approximate measure of similarity between the representations; the model has the special method of accounting forms of the words and attitudes of characters in words, thereby creating the basis for constructing a lexeme index, the implementation of methods to search for similarities, cleaning and integration sets data; - a method of constructing lexeme index that unlike known analogues has reduced number of entries because of mapping all forms of one word in the index entry, which allows pre-filtering to reduce the amount of hard computation of similarity measures between the sample and all forms of words; improved method of tolerant retrieval of text in the reference and search information systems through the use of model of the inflectional paradigm and lexeme index, which improves the accuracy and completeness of the pre-filtering; further developed the information technology of cleaning and integration of data sets that by improving method of tolerant retrieval simplifies the operation of calculating the similarity measure. Degree of implementation: the results of the thesis are applied in practice of an entrance committee for the data cleaning in "System of registration of applicants" State institution of higher education «ZNU» and to link records of the system and the system of online application for admission to "United state electronic database on education" MES of Ukraine, as well as used in the educational process at the department of information technologies of the State institution of higher education «ZNU». Application is in cleaning and integration of information systems.

Files

Similar theses