This Dissertation focuses on in-depth exploration into the design and development of self-supervised learning algorithms, which are a subset of unsupervised learning techniques that operate without the need for labeled datasets. These algorithms are particularly adept at pre-training models in an unsupervised manner, with the resultant models demonstrating performance on par with their supervised counterparts across a range of downstream applications. This method is particularly advantageous as it aims to mitigate the over-dependence on extensive data labeling that is typical within deep learning paradigms, thereby enhancing efficiency and practical utility in diverse real-world scenarios. The pertinence of self-supervised learning algorithms is especially highlighted within the realm of medical image analysis. In this specialized field, the requisites for data annotation are not only laborious but also require a high degree of precision due to the critical nature of the data involved. The difficulty of obtaining accurate annotations is compounded by the scarcity of specialists capable of providing them, which in turn underscores the transformative potential of self-supervised learning approaches within this domain.
In this dissertation, a cutting-edge self-supervised learning methodology is delineated, which employs the Mixup Feature as the reconstruction target within the pretext task. This pretext task is fundamentally designed to encapsulate visual representations by the prediction of Mixup features from masked image, utilizing these feature maps to extracting high-level semantic information. The dissertation delves into the validation of the Mixup Feature's role as a predictive target in self-supervised learning frameworks. This investigation involved the meticulous calibration of the hyperparameter λ, integral to the Mixup Feature operation. Such adjustments allowed for the generation of amalgamated feature maps that encompass Sobel edge detection maps, Histogram of Oriented Gradients (HOG) maps, and Local Binary Pattern (LBP) maps, providing a rich, multifaceted representation of visual data.
A denoising self-distillation Masked Autoencoder model for self-supervised learning was developed. This model synthesizes elements from Siamese Networks and Masked Autoencoders, incorporating a tripartite architecture that includes a student network in the form of a masked autoencoder, an intermediary regressor, and a teacher network. The underlying proxy task for this model is the restoration of input images that have been artificially corrupted with random Gaussian noise patches. To ensure comprehensive learning, the model harnesses a dual loss function mechanism. One function is calibrated to reinforce the global contextual understanding of the image, thereby enabling the model to grasp the overall structure and scene configuration. Concurrently, the second function is tailored to refine the perception of intricate local details, ensuring that fine visual nuances are not lost in the process of denoising and reconstruction. Through this innovative approach, the model aspires to achieve a delicate balance between the macroscopic comprehension of visual scenes and the meticulous reconstruction of localized details, a balance that is pivotal for sophisticated image analysis tasks in self-supervised learning frameworks.
This study conducted a comprehensive analysis of two novel self-supervised learning algorithms on benchmark datasets Cifar-10, Cifar-100, and STL-10, benchmarking them against advanced techniques in Masked Image Modeling. The mixed HOG-Sobel feature maps, utilized in Mixup, outperformed other state-of-the-art methods on Cifar-10 and STL-10 with a 0.4% average improvement after full fine-tuning. Additionally, the Deep Masked Autoencoder (DMAE) surpassed the conventional Masked Autoencoder (MAE) by 0.1% on STL-10, highlighting DMAE's potential in enhancing model accuracy. The study also found that the Mixup Feature method was more efficient than traditional contrastive learning-based strategies, offering shorter training times and eliminating the need for standard data augmentation, thus simplifying the learning process. These findings underscore the potential of these self-supervised learning algorithms for broader applications in complex datasets.
The application of these self-supervised learning algorithms has been extended through pre-training on specially curated medical image datasets, leading to their effective use in downstream tasks. This study demonstrates that self-supervised pre-training outperforms direct training methods, showing over 5% increase in accuracy upon full fine-tuning on two datasets. Additionally, it addresses the challenge of data imbalance in medical imaging by investigating the robustness of self-supervised pre-trained models against imbalanced datasets, highlighting their significance in both model training and feature extraction.