Petrov O. Models, methods and tools for distributed data warehouses optimization

Українська версія

Thesis for the degree of Candidate of Sciences (CSc)

State registration number

0409U005036

Applicant for

Specialization

  • 05.13.06 - Інформаційні технології

16-10-2009

Specialized Academic Board

К11.051.08

Essay

Candidate of technical sciences' thesis on specialty 05.13.06 - Information technologies. Donetsk national technical university, 2009. In this dissertation a new scientific and practical problem of optimization of a set of materialized view, and also of optimization of data allocation in a distributed data warehouse is being observed. The thesis is devoted to development of new models and methods of distributed data warehouses (DDW) optimization. Structural analysis of distributed data warehouse has been conducted and on the basis of this analysis the main components of its physical and logical architecture have been revealed. Then using object-oriented approach object models of the distributed data warehouse's typical components were developed. For these models' development UML language was used. A new approach to the problem of optimization of distributed data warehouses has been proposed. This approach is based on a simultaneous use the object-oriented model of distributed data warehouse that had been developed before and a modified genetic algorithm. The modified genetic algorithm is used to obtain new schemas of data allocation in a distributed data warehouse (by manipulations with schemas of data allocation that are represented by doubled multichromosomes with genetic operators of recombination, crossover and mutation). In the same time object model of distributed data warehouse receives schemas of data allocation in DDW and then it is being used to calculate values of criterion of efficiency of distributed data warehouse. These values then are sent to the genetic algorithm which uses them as values of fitness-function. The approach proposed allows increasing distributed data warehouse's efficiency by decreasing of the average time of user select queries processing. A new modification of genetic algorithm for the problem of distributed data warehose optimization was developed. For encoding schemas of data distribution in distributed data warehouse for the first time doubled multichromosomes were used. New versions of genetic operators of recombination, crossover and mutation were developed. This modification of genetic algorithm allows obtaining suboptimal solutions of the problem of distributed data warehouse optimization. Also in this modification not only data allocation between the nodes of distributed data warehouse is being optimized but the set of materialized views is being optimized. Basing on the developed object model and the modified genetic algorithm a new tool for distributed data warehouse modeling and optimization was developed. This tool consists of three subsystems: subsystem of modeling, analysis and optimization of distributed data warehouse. The tool that has been developed can be used at the stage of distributed data warehouse development as well as it can be used when a distributed data warehouse is already implemented and functioning. As an object of experiments the distributed data warehouse of Donetsk regional office of Prominvestbank was selected. Real data of functioning of this DDW were obtained and used to test model adequacy. Then a series of computational experiments of modeling and optimization of distributed data warehouse was made. At the beginning an influence of data allocation and technical characteristics of DDW to its efficiency was researches. These experiments showed that doubling of channels' capacity leads to a slight increasing of DDW efficiency. Also it has been researched that creation of some new materialized views and allocation them on the nodes of most use leads to increasing of DDW efficiency as well. But this increase isn't enough to solve the distributed data warehouse optimization problem since the solutions obtained are far from optimal anyhow. Basing on the experiments' results rational parameters of the modified genetic algorithm were determined (such as probabilities genetic operators, number of generation and number of individuals in the population). Then these parameters were used for genetic algorithm that was used for data warehouse optimization. The suboptimal solution obtained with use of genetic algorithm was compared with optimal solution that was obtained used brute force method and the difference between these solutions was just 4,37%. This means that the genetic algorithm developed allows to receive suboptimal solutions that are very close to the global optimum. Also some recommendations for distributed data warehouse efficiency increasing were developed. Keywords: distributed data warehouse, fact table, dimension table, materialized view, fragmentation, replication, object-oriented modeling, genetic algorithm, crossover, mutation, and recombination.

Files

Similar theses