Serheiev D. Natural language texts processing technology based on the integrational approach

Українська версія

Thesis for the degree of Candidate of Sciences (CSc)

State registration number

0419U004383

Thesis Registration Form

0419U004383.pdf

Applicant for

Serheiev Danylo Serhiyovych

Specialization

05.13.06 - Інформаційні технології

Date of defense

04-10-2019

Specialized Academic Board

Д 26.002.29

National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute"

Essay

In the recent years, the research in the field of natural language processing (NLP) has achieved significant practical results, including natural-language voice user interface for mobile devices, significant progress in machine translation technologies, handwriting and voice recognition, etc. At the same time, the task of improving the performance of these systems remains relevant. This study focuses on developing information technology for processing natural language texts based on the integrational approach, aimed to increase efficiency of natural language processing technologies. The subject of the research is models, methods, algorithms and information technologies for natural language processing. Based on the analysis of actual problems in the field of natural language processing, it is shown that applied technologies of natural language processing are successful in fulfilling the intended specific tasks, but it is determined that there is a room for improvement in the area of solving complex problems, in particular, machine translation and natural language search. The role of knowledge bases in information technologies of natural language processing is determined as a necessary component for the interaction of different systems. Existing approaches to developing natural-language knowledge bases are characterized and analyzed. A conclusion is made that existing technologies behind natural-language knowledge bases separately allow to achieve high levels of completeness, consistency and flexibility for practical purposes, but no technology combines high scores on all of the aforementioned qualities. A formal model of knowledge representation for a natural-language knowledge base is created, including models of its main elements, namely the quantum of knowledge, or the smallest element of knowledge, and the relation objects that describe connections between quanta of knowledge. A method for processing natural language texts based on this model of knowledge is developed, including procedures for using information technology in applied natural language processing problems. On the basis of the created models and the method, the procedures for writing and searching natural language skills for natural language processing technologies are developed, which allow to establish links at the structural level between the syntactic structure of the text and the arbitrary structure of the metadata. It is theoretically shown that the complexity of the natural-language search using the developed procedures does not exceed the complexity of the analogues, and on average is less than that of the analogues for complex search queries. Examples of use of the developed information technology for processing natural language texts in practical problems, namely natural language search and machine translation are provided. Writing and searching methods are created based on the knowledge representation model, allowing to establish links at the structural level between syntactic structure of the text and arbitrary structure of the metadata in natural language processing technologies. Information technology for processing natural language texts based on the integrational approach is developed, for which it is theoretically proven that the search complexity does not exceed that of the existing alternatives, and is on average 5-12% lower for complex search queries. Subsystems and operations of such system are defined, and database scheme is developed. Computational complexity of natural language knowledge search in the information system is analyzed and compared with the existing alternatives. Experimental testing of the information system is conducted and the acquired data are analyzed, demonstrating increased relevance of search results of natural language search. Within the framework of the work, information technology for the processing of natural language texts has been developed on the basis of the integrational approach. Based on experimental data acquired from measuring relevance of natural language search results, it has been shown that the developed information technology can increase relevance of search results. Specifically, relevance was increased by 14% on average for the whole set of experimental queries and search results, with no significant increase in relevance detected for the top quartile of results sorted by original relevance, and major increase detected for the lower quartile of original results. The information technology for the processing of natural language texts can be used to improve performance of various natural language processing technologies, in particular natural language search systems, machine translation systems and natural language user interfaces.

Thesis supervisor

Kyslenko Yuri I.

Official opponents

Barmak Oleksandr V.
Lande Dmytro V.

Files

Sergeiev_disser_2019.09.03.pdf

autoreferat--Sergeiev_aref_2019.09.02.pdf

Similar theses

0524U000111

Tetiana A. Honcharenko

Methodological foundations of the formation of a unified information environment for the automation of object-spatial systems in construction projects

0524U000108

Ihor M. Liakh

Methodological foundations of information technology for gene expression data processing and its application in the field of bioinformatics

0524U000074

Andrii Shyshatskyi

Intelligent methods of managing interference protection of radio communication systems under conditions of destabilizing influences

0524U000069

Gnatchuk Elizaveta Gennadievna

Theoretical and applied principles of information technology for supporting medical decision-making considering the civil law grounds

0424U000041

Andrii V. Shokarev

Information and hardware support for eliminating tilting of multistory buildings