This thesis focuses on the study of the possibilities of L1-regularization application in the construction of "structure-activity" chemometric models and quantum chemical calculations. To perform the tasks of the thesis, an original set of programs has been developed that implement various statistical (chemometric) approaches to the construction of regression models and analysis of their prognostic properties. A set of quantum chemical programs has also been created, in which L1-regularization is used to construct wave functions of methods that take into account electronic correlation.
In particular, in the thesis we consider application of L1-regularization to obtain linear empirical models for describing various physicochemical parameters of molecules. Based on the studied samples of molecules, it was shown that with the use of L1-regularization it is always possible to form a sequential (ordered) set of descriptors. By systematically adding descriptors from this set to linear regression models or artificial neural networks, it is possible to obtain regression models with successively increasing values of validation criteria. Due to the fact that after ranking of the descriptors set, the selected predictors can be used in different approaches to construct linear regression models, we conducted a corresponding study of the quality of these alternative models. It has been shown that the different methods can have better prognostic abilities according to the criteria of external or internal validation. It is shown that with the use of artificial neural networks, based on the preliminary ordered by the method of L1-regularization descriptor set, high-quality predictions of the properties of matter can also be made. The obtained linear regression equations were also compared with alternative approaches that work with non-shrinked (non-optimized) descriptor sets. In the studied examples, we used L1-regularization to formulate compact one-, two- or three-parametric models that are able to satisfactorily describe the data set. According to the studied examples, the models obtained with pre-selection, using LARS-LASSO, turned out to be better than the results of PLS and PCR calculations.
In the proposed PhD thesis some attention is paid to validation methods and quality of regression equations estimates. For this purpose, a model problem was used in which errors were introduced in both the dependent and independent variables. We considered the simplest case regression with one independent variable. It has been shown that random single sampling on the training and test sets is not informative. Therefore, in order to adequately estimate the quality of the regression equation, as well as to study the quality of the input data in general, it is necessary to create and study as many samplings into a training and test sample as possible. The known validation criteria proposed to date were also investigated. It is established that for data with substantial scatter the typical picture is the inverse (essentially nonlinear) dependence between external and internal validation criteria.
Another problem that is closely related to the construction of statistical models is the construction of the classification function. For this purpose, the L1-regularized calculation of logistic regression was performed in this work. It is shown with the studied classification tasks that with the use of L1-regularized logistic regression it is possible to achieve classification results that are competitive with those obtained using other, more complex in the computational sense, methods. The use of a special L1-regularized algorithm made it possible to obtain fairly simple classification equations that are interpretable. Also, the obtained logistic regression equations are unambiguous and reproducible.
It is shown that the L1-regularization method can be used in quantum chemistry. Using the L1-regularization procedure, it is possible to create an ordered (ranked) set of electronically excited configurations relative to the Gartree-Fock state. By including a different number of configurations from the created set, it is possible to obtain a progressive set of approximations to the exact calculations of the methods. The method is implemented in the framework of Meller-Plessett's theory of second-order perturbations (MP2) and different levels of the coupled clusters theory. It has been shown that such approximate solutions give fairly accurate values of the energy characteristics of molecules, and the number of configurations in the calculations can be much lower than in calculations using a complete configuration set of the exact method. A number of computational algorithms using first-order multi-step methods have been implemented to effectively solve the corresponding equations of the coupled clusters theory.