Vysotska V. Analysis and synthesis of computational linguistic systems for processing Ukrainian textual content

Українська версія

Thesis for the degree of Doctor of Science (DSc)

State registration number

0523U100152

Applicant for

Specialization

  • 10.02.21 - Структурна, прикладна та математична лінгвістика

14-09-2023

Specialized Academic Board

Д 35.052.05

Lviv Polytechnic National University

Essay

The dissertation solves an important scientific and applied problem of analysis and synthesis of computer linguistic systems (CLS) for solving various problems of processing Ukrainian-language text content. It is based on the development and improvement of new and existing models, methods and tools for natural language processing (NLP). The analysis and synthesis of СLS is based on the application of linguistic analysis of Ukrainian-language textual content, intelligent processing of textual flow of content, machine learning of the system based on reliable data, and statistical analysis to find patterns in the appearance of linguistic events. Developed information technology (IT) for processing of Ukrainian-language textual content, unlike the existing ones, supports the modularity principle of the typical architecture of the CLS for solving a specific task of the NLP and analysing a set of parameters and metrics of effectiveness of the system in accordance with the behaviour of the target audience. The general structure of the CLS for the processing of text content in the Ukrainian language and the conceptual scheme/model of functioning of a typical CLS based on the modelling of the interaction of the main processes and components of the system were developed, which made possible to improve IT intellectual analysis of the text flow based on the processing of information resources. There are examples of developed CLS for processing Ukrainian-language textual content for solving relevant tasks of the NLP, functioning of which is based on developed and improved models, methods and algorithms. An improved model of linguistic processing of textual content based on graphemic, morphological, lexical, syntactic, semantic, structural, ontological and pragmatic analysis to solve a specific problem of NLP is introduced. It has enabled the formulation of general requirements for Ukrainian content processing. Process improvement methodologies for information resources such as integration, management and content support of the Ukrainian language allow to adapt the intellectual analysis of the text stream processing to the solution of various tasks of NLP and develop effective CLS and metrics to solve various NLP problems. NLP methods based on regular pattern-matching expressions are improved and it has allowed the adaptation of grapheme and morphological analysis algorithms to Ukrainian text processing. A method of tokenisation and normalisation of text by cascades of simple substitutions of regular expressions and finite state machines is upgraded and resulted in the adaptation of the lexical and syntactic analysis algorithm for Ukrainian text processing. The morphological analysis method based on word segmentation and normalisation, sentence segmentation, and a modified Porter stemming algorithm as an effective tool for identifying lemmas affixes to tag the analysed word is improved. It has resulted in a 9% increase in keyword search accuracy. A method of identifying keywords in Ukrainian texts based on grapheme and morphological analysis of the word base using regular expressions and N-grams is elaborated. It has increased the accuracy of keyword searches by 6-9%, stable word combinations and categorise content search. A method for determining stable word combinations based on the identification of keywords in a Ukrainian text and the lexical coefficients analysis of the text author in the reference text is developed. The accuracy of the method for determining the author's style, based on statistical linguistics, has been improved by 6-7%. A method for determining the author's style of thematic Ukrainian textual content based on the analysis of keywords, stable phrases, N-grams, linguometry and stylometry is developed. It has enabled the recognition of the stylistic contribution of each author and increased the accuracy of scientific and technical publications attribution by 6-12%. A method is developed to verify the authorship level of a Ukrainian text from the number of possible authors, based on a stylistic comparison analysis of the potential authors. It has improved the classification accuracy of style similarity to [9;34]% of the total number of project participants. The analysis and synthesis methods of CLS are developed based on the creation of an organisational structure of the Ukrainian text processing system through the support of modularity, and modelling the main processes and components interaction. It has improved the number of solutions to various typical NLP problems by implementing typical software systems. CLS is realised on the platform http://victana.lviv.ua using CMS Joomla! (developing the site e-framework), PHP (implementation of text content processing methods), HTML (page mark-up), CSS (description of page styles), MySQL (storing data and dictionaries).

Research papers

1. Lytvyn V., Pukach P., Vysotska V., Vovk M., Kholodna N. Identification and correction of grammatical errors in Ukrainian texts based on machine learning technology. Mathematics. 2023. Vol. 11. 904.

2. Bisikalo O., Danylchuk O., Kovtun V., Kovtun O., Nikitenko O., Vysotska V. Modeling of operation of information system for critical use in the conditions of influence of a complex certain negative factor. International Journal of Control, Automation and Systems. 2022. Vol. 20. Р. 904–1913.

3. Bublyk M., Kowalska-Styczeń A., Lytvyn V., Vysotska V. The Ukrainian economy transformation into the circular based on fuzzy-logic cluster analysis. Energies. 2021. Vol. 14(18). Art. 5951.

3. Bublyk M., Kowalska-Styczeń A., Lytvyn V., Vysotska V. The Ukrainian economy transformation into the circular based on fuzzy-logic cluster analysis. Energies. 2021. Vol. 14(18). Art. 5951.

4. Lytvyn V., Vysotska V., Peleshchak I., Rishnyak I., Peleshchak R. Time dependence of the output signal morphology for nonlinear oscillator neuron based on Van der Pol model. International Journal of Intelligent Systems and Applications. 2018.Vol. 10(4). Р. 8–17.

5. Висоцька В. Метод авторифікації тексту науково-технічних публікацій на основі лінгвістичного аналізу коефіцієнтів мовної різноманітності. Радіоелектроніка. Інформатика. Управління. 2020. № 1(52). С. 108–124.

5. Висоцька В. Метод авторифікації тексту науково-технічних публікацій на основі лінгвістичного аналізу коефіцієнтів мовної різноманітності. Радіоелектроніка. Інформатика. Управління. 2020. № 1(52). С. 108–124.

6. Висоцька В. Інформаційна технологія просування інтернет-ресурсів в пошукових системах на основі контент-аналізу ключових слів web-сторінок. Радіоелектроніка, інформатика, управління. 2021 № 3 (58). C. 133-151.

7. Алєксєєва К. А., Берко А. Ю., Висоцька В. А. Технологія управління комерційним web-ресурсом на основі нечіткої логіки. Радіоелектроніка. Інформатика. Управління. 2015. № 3 (34). С. 71–79.

8. Бісікало О. В., Висоцька В. А. Виявлення ключових слів на основі методу контент-моніторингу україномовних текстів. Радіоелектроніка. Інформатика. Управління. 2016. № 1 (36). С. 74–83.

8. Бісікало О. В., Висоцька В. А. Виявлення ключових слів на основі методу контент-моніторингу україномовних текстів. Радіоелектроніка. Інформатика. Управління. 2016. № 1 (36). С. 74–83.

9. Бісікало О. В., Висоцька В. А. Застосування методу синтаксичного аналізу речень для визначення ключових слів україномовного тексту. Радіоелектроніка. Інформатика. Управління. 2016. № 3 (38). С. 54–65.

10. Lytvyn V., Pukach P., Bobyk І., Vysotska V. The method of formation of the status of personality understanding based on the content analysis. Eastern-European Journal of Enterprise Technologies. 2016. Vol. 5. P. 4–12.

11. Литвин В. В., Бобик І. О., Висоцька В. А. Застосування системи алгоритмічних алгебр для граматичного аналізу символьних обчислень виразів логіки висловлювань. Радіоелектроніка. Інформатика. Управління. 2016. № 4 (39). С. 77–89.

12. Lytvyn V., Vysotska V., Pukach P., Bobyk І., Pakholok B. A method for constructing recruitment rules based on the analysis of a specialist’s competences. Eastern-European Journal of Enterprise Technologies. 2016. Vol. 6/2 (84). P. 4–14.

13. Lytvyn V., Vysotska V., Pukach P., Brodyak O., Ugryn D. Development of a method for determining the keywords in the Slavic language texts based on the technology of web mining. Eastern-European Journal of Enterprise Technologies. 2017. Vol. 2/2 (86). Р. 14–23.

13. Lytvyn V., Vysotska V., Pukach P., Brodyak O., Ugryn D. Development of a method for determining the keywords in the Slavic language texts based on the technology of web mining. Eastern-European Journal of Enterprise Technologies. 2017. Vol. 2/2 (86). Р. 14–23.

14. Lytvyn V., Vysotska V., Pukach P., Vovk M., Ugryn D. Method of functioning of intelligent agents, designed to solve action planning problems based on ontological approach. Eastern-European Journal of Enterprise Technologies. 2017. Vol. 3/2 (87). Р. 11–17.

15. Lytvyn V., Vysotska V., Pukach P., Bobyk І., Uhryn D. Development of a method for the recognition of author’s style in the Ukrainian language texts based on linguometry, stylemetry. Eastern-European Journal of Enterprise Technologies. 2017. Vol. 4/2 (88). P. 10–18.

15. Lytvyn V., Vysotska V., Pukach P., Bobyk І., Uhryn D. Development of a method for the recognition of author’s style in the Ukrainian language texts based on linguometry, stylemetry. Eastern-European Journal of Enterprise Technologies. 2017. Vol. 4/2 (88). P. 10–18.

16. Коробчинський М. В., Чирун Л. Б., Висоцька В. А., Нич М. О. Особливості прогнозування результатів матчів у кіберспорті. Радіоелектроніка. Інформатика. Управління. 2017. № 3 (42). С. 95–105.

17. Коробчинський М. В., Чирун Л. Б., Висоцька В. А., Кондратьєв Є. О. Особливості формування та аналізу контенту інтернет-газети музичних новин. Радіоелектроніка. Інформатика. Управління. 2017. № 4. С. 139–150.

17. Коробчинський М. В., Чирун Л. Б., Висоцька В. А., Кондратьєв Є. О. Особливості формування та аналізу контенту інтернет-газети музичних новин. Радіоелектроніка. Інформатика. Управління. 2017. № 4. С. 139–150.

18. Lytvyn V., Vysotska V., Uhryn D., Hrendus M., Naum O. Analysis of statistical methods for stable combinations determination of keywords identification. Eastern-European Journal of Enterprise Technologies. 2018. Vol. 2/2 (92). P. 23–37.

19. Lytvyn V., Vysotska V., Maria H. Method of data expression from the Ukrainian content based on the ontological approach. Радіоелектроніка. Інформатика. Управління. 2018. № 3 (46). P. 144–157.

20. Lytvyn V., Vysotska V., Pukach P., Nytrebych Z., Demkiv I., Kovalchuk R., Huzyk N. Development of the linguometric method for automatic identification of the author of text content based on statistical analysis of language diversity coefficients. Eastern-European Journal of Enterprise Technologies. 2018. Vol. 5/2 (95). P. 16–28.

21. Pelekh I., Lytvyn V., Vysotska V., Kuchkovskiy V., Bobyk I., Malanchuk O., Ryshkovets Y., Brodyak O., Bobrivetc V., Panasyuk V. Development of the system to integrate and generate content considering the cryptocurrent needs of users. Eastern-European Journal of Enterprise Technologies. 2019. Vol. 1/2(97). P. 18–39.

22. Lytvyn V., Vysotska V., Pukach P., Nytrebych Z., Demkiv I., Senyk A., Malanchuk O., Sachenko S., Kovalchuk R., Huzyk N. Analysis of the developed quantitative method for automatic attribution of scientific and technical text content written in Ukrainian. Eastern-European Journal of Enterprise Technologies. 2018. Vol. 6/2 (96). P. 19–31.

23. Berko A., Vysotska V., Lytvyn V., Naum O. Planning the activities of intellectual agents in the electronic commerce systems. Радіоелектроніка. Інформатика. Управління. 2018. № 4. С. 143–158.

24. Lytvyn V., Vysotska V., Demchuk A., Demkiv I., Ukhans’ka O., Hladun V., Kovalchuk R., Petruchenko O., Dzyubyk L., Sokulska N. Design of the architecture of an intelligent system for distributing commercial content in the internet space based on SEO-technologies, neural networks, and machine learning. Eastern-European Journal of Enterprise Technologies. 2019. Vol. 2/2(98). P. 15–34.

25. Lytvyn V., Vysotska V., Shatskykh V., Kohut I., Petruchenko O., Dzyubyk L., Bobrivetc V., Panasyuk V., Sachenko S., Komar M. Design of a recommendation system based on collaborative filtering and machine learning considering personal needs of the user. Eastern-European Journal of Enterprise Technologies. 2019. Vol. 4/2 (100). P. 6–28.

26. Vysotska V., Demchuk A., Lytvyn V. Features of the architecture for Internet commercial content management system based on methods of Machine Learning, Web mining and SEO technologies. Радіоелектроніка. Інформатика. Управління. 2019. № 4. С. 121–135.

27. Lytvyn V., Vysotska V., Budz I., Pelekh Y., Sokulska N., Kovalchuk R., Dzyubyk L., Tereshchuk O., Komar M. Development of the quantitative method for automated text content authorship attribution based on the statistical analysis of N-grams distribution. Eastern-European Journal of Enterprise Technologies. 2019. Vol. 6/2 (102). P. 28–51.

28. Кравець П., Литвин В., Висоцька В. Ігрова модель онтологічної підтримки проектів. Радіоелектроніка, інформатика, управління. 2021. № 1(56). С. 172–183.

29. Литвин В. В., Бублик М. І., Висоцька В. А., Мацелюх Ю. Р. Технологія візуальної симуляції пасажиропотоків у сфері громадського транспорту smart city. Радіоелектроніка, інформатика, управління. 2021 № 4 (59). C. 106-121.

30. Кравець П. О., Литвин В. В., Висоцька В. А. Моделювання ігрової задачі призначення персоналу для виконання ІТ-проектів на основі онтологій. Радіоелектроніка, інформатика, управління. 2022. № 1 (60). С. 130–145.

31. Lytvyn V., Vysotska V., Veres O., Rishnyak I., Rishnyak H. Classification methods of text documents using ontology based approach. Advances in Intelligent Systems and Computing. 2017. Vol. 512. P. 229–240.

32. Lytvyn V., Vysotska V., Burov Y., Veres O., Rishnyak I. The contextual search method based on domain thesaurus. Advances in Intelligent Systems and Computing. 2018. Vol. 689. P. 310–319.

33. Kanishcheva O., Vysotska V., Chyrun L., Gozhyj A. Method of integration and content management of the information resources network. Advances in Intelligent Systems and Computing. 2018. Vol. 689. P. 204–216.

34. Vysotska V., Fernandes B. V., Emmerich M. Web content support method in electronic business systems. CEUR Workshop Proceedings. 2018. Vol. 2136. P. 20–41.

35. Lytvyn V., Vysotska V., Dosyn D., Burov Y. Method for ontology content and structure optimization, provided by a weighted conceptual graph. Webology. 2018. Vol. 15(2). P. 66–85.

36. Lytvyn V., Vysotska V., Osypov M., Slyusarchuk O., Slyusarchuk Y. Development of intellectual system for data de-duplication and distribution in cloud storage. Webology. 2019. Vol. 16. P. 1-42.

37. Vysotska V., Lytvyn V., Burov Y., Gozhyj A. Makara, S. The consolidated information web-resource about pharmacy networks in city. CEUR Workshop Proceedings. 2018. Vol. 2255. P. 239–255.

38. Rusyn B., Lytvyn V., Vysotska V., Emmerich M., Pohreliuk L. The virtual library system design and development. Advances in Intelligent Systems and Computing. 2019. Vol. 871. P. 328–349.

39. Vysotska V., Fernandes B. V., Lytvyn V., Emmerich M., Hirnyak M. Method for determining linguometric coefficient dynamics of Ukrainian text content authorship. Advances in Intelligent Systems and Computing. 2019. Vol. 871. P. 132–151.

40. Gozhyj A., Vysotska V., Yevseyeva I., Kalinina I., Gozhyj V. Web resources management method based on intelligent technologies. Advances in Intelligent Systems and Computing (AISC). 2019. Vol. 871. P. 206–221.

Files

Similar theses