The dissertation is devoted to solving an urgent scientific problem, the essence of
which is to increase the efficiency of applying secure recognition and parameterization
of voice information processing results by combining natural language and voice
information recognition approaches to build voice authentication systems, detect
intentions and determine the emotional state of subjects in information and
communication systems, as well as implement cybersecurity management measures at
state-owned enterprises and in private.
The methodology of voice information processing is a powerful tool that has a
significant impact on the security of the state and the work of commercial organizations
through the automation of monitoring processes of electronic communications and audio
archives, based on real-time recognition of speech, emotions, and intentions, which is
facilitated by several factors that make us pay attention to the methodology and the
relevance of their improvement:
1. The changing landscape of cyber threats. With the advent of generative models
and increased computing power, traditional security models that rely on highly structured
data no longer adequately detect and respond to fake audio data. Therefore, the tasks of
detecting, registering, and responding to new challenges, as well as the rapid development
of this industry, are becoming urgent.
2. Transition of voice information from telephone conversations to teleconferences.
When traditional telephone conversations were used, the telecom operator and
government agencies potentially had access to their content. Therefore, the duration and
content of conversations were shorter and subject to self-censorship. With the transition
to teleconferencing, the cost of calls decreased, and the proliferation of end-to-end
encryption methods created a perception of security, subscribers began to have more open
and longer conversations, which became especially relevant in the era of remote work.
Also, due to the increase in the volume of voice information, the state must process it
faster to detect, for example, terrorist threats, and for private enterprises to detect leaks of
confidential data.
3. Data breaches and external threats. Deepfakes and the introduction of distortions
in the original audio data of a subscriber pose a threat of oversaturation of the information
system with requests. Detecting and counteracting fraud in intent analysis, including the
generation of a large number of fake intentions, leads to the overloading of externally
connected systems and limiting response resources, which poses a threat of not receiving
attention from legitimate actors.
4. Expanding the role of cloud services. As businesses and organizations
increasingly use cloud services to store confidential audio data, there is a need for
additional processing, including depersonalization and removal of sensitive data from the
audio stream.
5. Compliance requirements. The personal data of subscribers is subject to
confidentiality requirements within the framework of governmental standards (GDPR,
HIPAA), commercial (PCI DSS), and/or ethical restrictions. Audio data, in turn, is a
difficult type of information to search and analyze in a structured way due to the
requirements and restrictions.
6. Continuous monitoring and adaptive security. Voice data can be processed both
archived and in real-time, but the bottleneck of information and communication systems
is streaming data processing. Therefore, incident response can be carried out in two ways:
immediate actions and incident investigation, but both approaches have their own set of
unresolved issues.
7. Incident response and threat detection. Voice recognition systems do not have
incident response mechanisms, so they must signal other systems in real time. Integration
with external information and communication systems for security has limitations on
performance and delays in processing requests, but still reduces potential damage. It
should also be noted that the relevance of the response decreases dramatically over time.