Zarichkovyi O. Algorithmic software for video data annotation for computer vision tasks

Українська версія

Thesis for the degree of Doctor of Philosophy (PhD)

State registration number

0825U000390

Applicant for

Specialization

  • 121 - Інженерія програмного забезпечення

Specialized Academic Board

ДФ 26.002.197; PhD 7646

National Technical University of Ukraine «Igor Sikorsky Kyiv Polytechnic Institute»

Essay

PhD thesis in the field of knowledge 12 Information technologies in a specialty 121 Software engineering. – National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”, Kyiv, 2025. Artificial intelligence is one of the most prominent fields of software development in the modern world of information technologies. Significant progress has been achieved in computer vision tasks, particularly object detection, over the past decade due to the advancements in deep learning methods and the growth of computational capabilities. For the successful development and implementation of AI-based software tools, it is necessary to collect and annotate large volumes of data, which requires considerable human resources and time. Low-quality data annotation can lead to inaccurate and erroneous AI methods, consequently resulting in errors in software computations. Current data annotation tools do not always meet the needs of software developers working with AI, especially in the context of large-scale video data annotation, which increases the labor intensity of creating high-quality datasets. The outlined problems establish a pressing scientific task of improving the efficiency of video data annotation processes for computer vision tasks, which is addressed in this dissertation. The aim of the dissertation is to enhance the speed of video data annotation in object detection tasks by improving the methods and software tools designed for video data annotation. To achieve this aim, the study investigates neural network training methods that improve object detection accuracy without modifying models or increasing number of parameters, as well as approaches to reduce the number of frames processed by computer vision methods. Conducted study on visual-language models to improve accuracy. A dual-architecture system for automated data annotation and its supporting software have been developed. Experimental research demonstrates the effectiveness of the proposed solutions. One of the main challenges in achieving the aim is ensuring high-quality automated data annotation, which minimizes the percentage of errors requiring manual correction. To address this, various techniques and methods were employed, including the novel dual-architecture approach, a data prioritization method, an iterative keyframe selection method, and multimodal neural networks. The main result of this work is the creation of a dual-architecture software system for data annotation automation and the implementation of automatic video annotation methods. These methods ensure high annotation accuracy and reduce the time needed for post-annotation refinement by annotators after automated annotation. The developed methods were tested on real-world tasks to demonstrate their efficiency and advantages. In the dissertation, a number of new scientific results were obtained: For the first time, a dual software architecture for automated data annotation has been proposed. Utilizing the method of adaptively-aggregated neural network training, this architecture accelerates the annotation process and, unlike existing counterparts, enables the effective application of zero-shot and active neural network learning for data annotation. It also allows for more flexible software utilization across various computer vision tasks. For the first time, a novel method for prioritizing difficult samples for neural network training is introduced, improving dataset quality without prior video annotation and enhancing object detection accuracy. Unlike existing approaches, this approach relies solely on automatically generated data representations. For the first time, an iterative method for selecting keyframes in long videos is proposed, enabling accurate identification of keyframes and segments while accounting for dynamic content changes, improving segmentation accuracy and reducing video data for processing. For the first time, a method for aggregating knowledge between the textual and visual components in a visual-language model (VLM) has been proposed to model complex multimodal interactions, providing higher accuracy in recognizing complex scenes in videos and their descriptions compared to existing counterparts.. The main results of the dissertation were published in 6 scientific papers, in particular, in 4 scientific articles, which is indexed in the Scopus database, 1 article published in a scientific journal included in the list of scientific professional editions of Ukraine (category «B»), as well as 1 publication in materials of scientific and technical conferences.

Research papers

Зарічковий О.А. Дуальна архітектура програмного забезпечення для автоматизації розмітки даних для задач комп’ютерного зору. Міжвідомчий науково-технічний журнал «Адаптивні системи автоматичного управління». 2024. № 45 (2024). С. 109-118. DOI 10.20535/1560-8956.45.2024.313096.

Zarichkovyi, A., Stetsenko, I.V. (2024) ‘Attr4Vis: Revisiting importance of attribute classification in Vision-Language Models for Video Recognition’, International Journal of Computing, 23 (1), pp. 94-100. DOI 10.47839/ijc.23.1.3440

Zarichkovyi, A., Stetsenko, I.V. (2023) ‘Boundary Refinement via Zoom-In Algorithm for Keyshot Video Summarization of Long Sequences’, Lecture Notes on Data Engineering and Communications Technologies, 180, pp. 344-359. DOI 10.1007/978-3-031-36115-9_32

Zarichkovyi, A., Stetsenko, I.V. (2023) ‘Hard Samples Make Difference: An Improved Training Procedure for Video Action Recognition Tasks’, Lecture Notes in Networks and Systems, 544, pp. 508-519. Springer, Cham. DOI 10.1007/978-3-031-16075-2_36

Oleksandr Zarichkovyi and Iryna Mukha. (2021) ‘Approximate Training of Object Detection on Large-Scale Datasets’, Lecture Notes on Data Engineering and Communications Technologies, 83, pp. 389-400. DOI 10.1007/978-3-030-80472-5_32

Zarichkovyi, A., Stetsenko, I.V. Improving cross-modal knowledge exploration of vision language models. Інженерія програмного забезпечення і передові інформаційні технології (Soft Tech-2024): матеріали VI Міжнародної науково-практичної конференції молодих вчених та студентів, 21-23 травня 2024 року, м. Київ, Національний технічний університет України «Київський політехнічний інститут імені Ігоря Сікорського», ФІОТ. С. 58-61.

Similar theses