Lang C. Methodology and software classify natural language text documents

Українська версія

Thesis for the degree of Candidate of Sciences (CSc)

State registration number

0412U002876

Applicant for

Specialization

  • 05.13.05 - Комп'ютерні системи та компоненти

11-06-2012

Specialized Academic Board

Д 26.002.02

Publishing and Printing Institute of Igor Sikorsky Kyiv Polytechnic Institute

Essay

The thesis is devoted to solving the problem of automatic language identification and classification of natural language text documents. The method for automatic identification of languages using statistical N-grams, comparative analysis of different methods of classification of text documents in order to choose optimal precision and recall, the proposed classification natural language text documents using the method developed by statistical N-grams, the method automatically classify text documents in real time, created a software module for the identification and classification natural language text documents. The proposed method of classification of text documents allowed to improve accuracy and speed of classification, to develop appropriate software for use in automatic processing of texts in multilingual information systems. Keywords: automatic language identification, classification of text documents, natural language, N-grams, multi-label classification.

Files

Similar theses