Serheiev D. Natural language texts processing technology based on the integrational approach

Українська версія

Thesis for the degree of Candidate of Sciences (CSc)

State registration number

0419U004383

Applicant for

Specialization

  • 05.13.06 - Інформаційні технології

04-10-2019

Specialized Academic Board

Д 26.002.29

National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute"

Essay

In the recent years, the research in the field of natural language processing (NLP) has achieved significant practical results, including natural-language voice user interface for mobile devices, significant progress in machine translation technologies, handwriting and voice recognition, etc. At the same time, the task of improving the performance of these systems remains relevant. This study focuses on developing information technology for processing natural language texts based on the integrational approach, aimed to increase efficiency of natural language processing technologies. The subject of the research is models, methods, algorithms and information technologies for natural language processing. Based on the analysis of actual problems in the field of natural language processing, it is shown that applied technologies of natural language processing are successful in fulfilling the intended specific tasks, but it is determined that there is a room for improvement in the area of solving complex problems, in particular, machine translation and natural language search. The role of knowledge bases in information technologies of natural language processing is determined as a necessary component for the interaction of different systems. Existing approaches to developing natural-language knowledge bases are characterized and analyzed. A conclusion is made that existing technologies behind natural-language knowledge bases separately allow to achieve high levels of completeness, consistency and flexibility for practical purposes, but no technology combines high scores on all of the aforementioned qualities. A formal model of knowledge representation for a natural-language knowledge base is created, including models of its main elements, namely the quantum of knowledge, or the smallest element of knowledge, and the relation objects that describe connections between quanta of knowledge. A method for processing natural language texts based on this model of knowledge is developed, including procedures for using information technology in applied natural language processing problems. On the basis of the created models and the method, the procedures for writing and searching natural language skills for natural language processing technologies are developed, which allow to establish links at the structural level between the syntactic structure of the text and the arbitrary structure of the metadata. It is theoretically shown that the complexity of the natural-language search using the developed procedures does not exceed the complexity of the analogues, and on average is less than that of the analogues for complex search queries. Examples of use of the developed information technology for processing natural language texts in practical problems, namely natural language search and machine translation are provided. Writing and searching methods are created based on the knowledge representation model, allowing to establish links at the structural level between syntactic structure of the text and arbitrary structure of the metadata in natural language processing technologies. Information technology for processing natural language texts based on the integrational approach is developed, for which it is theoretically proven that the search complexity does not exceed that of the existing alternatives, and is on average 5-12% lower for complex search queries. Subsystems and operations of such system are defined, and database scheme is developed. Computational complexity of natural language knowledge search in the information system is analyzed and compared with the existing alternatives. Experimental testing of the information system is conducted and the acquired data are analyzed, demonstrating increased relevance of search results of natural language search. Within the framework of the work, information technology for the processing of natural language texts has been developed on the basis of the integrational approach. Based on experimental data acquired from measuring relevance of natural language search results, it has been shown that the developed information technology can increase relevance of search results. Specifically, relevance was increased by 14% on average for the whole set of experimental queries and search results, with no significant increase in relevance detected for the top quartile of results sorted by original relevance, and major increase detected for the lower quartile of original results. The information technology for the processing of natural language texts can be used to improve performance of various natural language processing technologies, in particular natural language search systems, machine translation systems and natural language user interfaces.

Files

Similar theses