Pryshchepa S. Information Technology of New Events Extraction Based on Linguistic Network Analysis in Global Networks

Українська версія

Thesis for the degree of Candidate of Sciences (CSc)

State registration number

0418U003920

Applicant for

Specialization

  • 05.13.06 - Інформаційні технології

29-11-2018

Specialized Academic Board

Д 26.861.05

State University of Telecommunications

Essay

In the dissertation work the actual scientific task of development of information technology of extraction of new events is solved on the basis of linguistic network analysis of information flows of social networks that is capable of working with large masses of poorly structured text data and a large amount of noise, which is inherent in modern social networks, in order to increase the efficiency of automatic extraction of events from information flows of social networks in the conditions of large volumes of information and considerable informational noise. The analysis of the current state of information technologies of extraction of events and the detection of their novelty and existing scientific methods for the fulfillment of this task was carried out. It has been established that the requirements for extracted information and the shortcomings of existing approaches to extracting information do not meet current needs, thus, there is a contradiction between the need for state and business structures to timely identify new events from the dynamically increasing volumes of information in global networks and the limited scientific and methodical apparatus for their effective detection. One way to resolve this contradiction is to develop an effective technology for extraction of new events from certain information flows. To do this, a method for identifying new events from the texts of social networking messages has been developed in the research. The essence of the method lies in the fact that from distributed arrays of documents of certain categories in which there are the key words from the dictionary triggers of the event a decision is made on whether the document belongs to the event or not, using the naive Bayesian classifier, and the use of the technique of text mining for the task of identifying concepts and entities, allows you to automatically analyze the document and compare it with the events available in the database by comparing the components of the event, which are represented by event vectors (event triggers, headers, key words, source rating) by cosine level and the degree of entry for other components (entity, location). Appointment of the method - to detect new events from text messages in poorly structured information flows through the intellectual processing of texts. To automatically fill the trigger dictionary of an event used in the developed event detection method, a new method for detecting descriptors in text arrays has been developed. The purpose of the method is to identify and use descriptors to generate queries and search for relevant documents for a specific news topic and use them as possible triggers for event classification. In order to improve the effectiveness of monitoring and detecting documents about an event from primary sources, a source rating methodology is developed which performs a comprehensive ranking of sources based on the assessment of the event and the links adapted for special social networking tags to determine the credibility of the authors of certain messages and the active mediators of the dissemination of information, and for the detection of the most probable source of the event, a graph of horizontal visibility is used. In this case, research methods are based on the use of the theory of mathematical analysis, probability theory and mathematical statistics, graph theory, the theory of complex networks, natural language processing, machine learning and computer linguistics. Instrumental and software tools implementing the developed technology in content monitoring systems InfoStream and X-SKIF, which confirm implementation acts, are developed. An assessment of the effectiveness of technology based on the test sample of documents in Ukrainian and Russian from the social network Twitter on the topic of Auto accident (Road Traffic accident) in the size of 1000 documents was carried out, which showed 3-5% better results on a balanced F-measure compared with other approaches extraction of new events from information flows of global networks.

Files

Similar theses