The relevance of the research topic is due to the existing contradiction between the complexity of deep learning models when using existing software solutions, on the one hand, and the growing difficulty of interpreting these models in an application with an increase in the speed of the procedure for recognizing and tracking objects in real time, on the other hand. The field of application of unmanned aerial vehicles (UAVs) is constantly growing and there is a need to adapt machine learning algorithms and deep learning models for use on board UAVs, taking into account the limited computing resources and the specifics of organizing the received video data stream. The paper emphasizes the importance of existing architectural solutions for the implementation of convolutional neural networks when using machine learning methods and algorithms. The identified contradiction is overcome by introducing additional conditions for the task of processing a large set of video stream images that are constantly updated and can contain information about objects of different sizes and shapes for the task of recognition and tracking. The necessity of using an adapted loss function in the proposed software solution to support decision-making based on the observed data is emphasized. The task of fusing features of convolutional neural networks for spatial and temporal flows is relevant in this area.
The purpose of the study is to improve the accuracy of detection, recognition and tracking of objects in real time and to implement the corresponding technology in the form of a software tool for the task as part of intelligent systems.
The object of research is the processes of data processing in the tasks of detecting, recognizing and tracking objects in real time.
The subject of the study is models, algorithms and technologies for using convolutional neural networks to solve the problem of recognizing and tracking an object in real time.
The scientific novelty of the research results is as follows:
– for the first time, an architectural solution for building a neural convolutional network for the task of detecting, recognizing, and tracking objects in real time has been developed, which differs from the existing solution in that it uses a larger number of object recognition units of different sizes, which is optimized for the tasks of a specific subject area;
– for the first time, the possibility of using the PFNB block in the developed technology, which is based on the Faster-Net architectural solution, which uses a multi-scale feature fusion network to demonstrate improved recognition accuracy compared to the basic technology, has been substantiated;
– for the first time, an own dataset was formed to test the developed technology starting from the stage of object recognition in a video stream, which includes objects of different scales of a certain subject area, which confirms the effectiveness of the developed model;
– for the first time, the architecture of a cross-platform library for the implementation of object detection, recognition and tracking technology is proposed, which is a five-scale structure and contains the BiFormer attention mechanism with low computing power, which improves the accuracy of detecting small objects and improves attention to key information on the object map;
– for the first time, comparative experiments were simulated on YOLO v9, which differ in the use of different types of loss functions while keeping other training conditions unchanged, which showed the use of the WIoU v3 regression loss function to be the most effective for the built model;
– for the first time, experiments were simulated when adding to the basic model the detection units of the PFNB group, which combine small features of the layers of the neural convolutional network, which increases the average value of mAP and, when used simultaneously, the model size and the number of parameters decrease;
– for the first time, experiments were simulated on the improved YOLO v9 P model, which differs from the basic YOLO v9 model in the loss function, fusion method and modified architecture of the recognition unit, which allowed to improve the average mAP value by about 7.7% and AP by about 2.5% to 14.1%.