Abstract
In the medical domain, acquiring large datasets poses significant challengesdue to privacy concerns. Nonetheless, the development of a robust deep-learningmodel for retinal disease diagnosis necessitates a substantial dataset fortraining. The capacity to generalize effectively on smaller datasets remains apersistent challenge. The scarcity of data presents a significant barrier tothe practical implementation of scalable medical AI solutions. To address thisissue, we've combined a wide range of data sources to improve performance andgeneralization to new data by giving it a deeper understanding of the datarepresentation from multi-modal datasets and developed a self-supervisedframework based on large language models (LLMs), SwinV2 to gain a deeperunderstanding of multi-modal dataset representations, enhancing the model'sability to extrapolate to new data for the detection of eye diseases usingoptical coherence tomography (OCT) images. We adopt a two-phase trainingmethodology, self-supervised pre-training, and fine-tuning on a downstreamsupervised classifier. An ablation study conducted across three datasetsemploying various encoder backbones, without data fusion, with low dataavailability setting, and without self-supervised pre-training scenarios,highlights the robustness of our method. Our findings demonstrate consistentperformance across these diverse conditions, showcasing superior generalizationcapabilities compared to the baseline model, ResNet-50.