Todoriko O. Models and methods of cleaning and integration of text data in information systems

Українська версія

Thesis for the degree of Candidate of Sciences (CSc)

State registration number

0416U003873

Thesis Registration Form

0416U003873.pdf

Applicant for

Todoriko Olha Oleksiivna

Specialization

05.13.06 - Інформаційні технології

Date of defense

30-06-2016

Specialized Academic Board

Д 64.050.07

National Technical University "Kharkiv Polytechnic Institute"

Essay

The object of study is processes of cleaning and integration of data in information reference system and information retrieval system. The purpose of the research is to improve technology of cleaning and integration of text data in information reference systems and information retrieval systems, through the use of model of an inflectional paradigm and methods of creation of a lexeme index at the organization of a tolerant retrieval. Research methods include methods of mathematical modeling, methods of object-oriented software analysis and synthesis using the unified modeling language, methods of creation linear and neural classifiers, methods of probability theory and statistical analysis of experimental data. Theoretical and practical results: the creation of software in the form of a class library in Java for organizing lexical similarity search and integration of data sets. Scientific novelty of the results: for the first time: - a model of the inflectional paradigm was developed, which differs from the existing ones in a way it represents words and calculates an approximate measure of similarity between the representations; the model has the special method of accounting forms of the words and attitudes of characters in words, thereby creating the basis for constructing a lexeme index, the implementation of methods to search for similarities, cleaning and integration sets data; - a method of constructing lexeme index that unlike known analogues has reduced number of entries because of mapping all forms of one word in the index entry, which allows pre-filtering to reduce the amount of hard computation of similarity measures between the sample and all forms of words; improved method of tolerant retrieval of text in the reference and search information systems through the use of model of the inflectional paradigm and lexeme index, which improves the accuracy and completeness of the pre-filtering; further developed the information technology of cleaning and integration of data sets that by improving method of tolerant retrieval simplifies the operation of calculating the similarity measure. Degree of implementation: the results of the thesis are applied in practice of an entrance committee for the data cleaning in "System of registration of applicants" State institution of higher education «ZNU» and to link records of the system and the system of online application for admission to "United state electronic database on education" MES of Ukraine, as well as used in the educational process at the department of information technologies of the State institution of higher education «ZNU». Application is in cleaning and integration of information systems.

Thesis supervisor

Gomenyuk, Sergiy Ivanovich

Official opponents

Шостак Ігор Володимирович
Хайрова Ніна Феліксівна
Шаронова Наталія Валеріївна

Files

aref.doc

dis.pdf

Similar theses

0424U000107

Oleksandra V. Kovyrova

Models and instrumental tools of express diagnostics for using in the biology and medicine

0524U000111

Tetiana A. Honcharenko

Methodological foundations of the formation of a unified information environment for the automation of object-spatial systems in construction projects

0524U000108

Ihor M. Liakh

Methodological foundations of information technology for gene expression data processing and its application in the field of bioinformatics

0524U000074

Andrii Shyshatskyi

Intelligent methods of managing interference protection of radio communication systems under conditions of destabilizing influences

0524U000069

Gnatchuk Elizaveta Gennadievna

Theoretical and applied principles of information technology for supporting medical decision-making considering the civil law grounds