Lupei M. Information technology of analysis and determination of author's and stylistic affiliation of Ukrainian-language texts

Українська версія

Thesis for the degree of Candidate of Sciences (CSc)

State registration number

0421U101633

Applicant for

Specialization

  • 05.13.06 - Інформаційні технології

26-04-2021

Specialized Academic Board

Д 35.101.01

Ukrainian Academy of Printing

Essay

In the dissertation work the decision of a scientific and applied problem of creation of information technology for the analysis and definition of author's and stylistic affiliation of the Ukrainian-language texts is offered. To do this, several stages of data processing were performed. The peculiarities of Ukrainian-language texts of different styles are analyzed and the peculiarities of Ukrainian grammar in relation to the formation of word endings are considered, which is necessary for the stage of preliminary processing of Ukrainian-language texts. The paper uses the method of steaming, which is specially adapted to the Ukrainian language. At the next stage, the existing methods of vectorization of Ukrainian-language text are analyzed, among which the method based on hash functions, the method of vectorization based on the inverse frequency of documents and the method of vectorization based on the frequency of documents are highlighted. The classification of different methods of vectorization of the Ukrainian language text in combination with different types of machine learning is carried out and their best ratio for each type of researched tasks is revealed. The choice of machine learning methods for the analysis and determination of the authorial and stylistic affiliation of Ukrainian-language texts, which include various combinations of architectures of artificial neural networks (MLP, SVR and SVC artificial neural networks). Based on the classification of methods and approaches of vectorization and classification, information technology has been developed for analysis and determination of the author's and stylistic affiliation of Ukrainian-language texts. As a result, the advantages of machine learning methods and their use to create information technology are identified. The choice of methods of vectorization of texts, their testing in the framework of information technology for the analysis of Ukrainian texts in combination with different types of machine learning, during which it was determined that the best results were obtained using vectorization using the inverse frequency of the document hash functions. The method of classifying textual information based on a multilayer perceptron has been improved through the use of specialized training procedures and regularization procedures, which makes it possible to reduce decision-making time without losing accuracy.

Files

Similar theses