Abstract
Multimodal embedding models aim to yield informative unified representationsthat empower diverse cross-modal tasks. Despite promising developments in theevolution from CLIP-based dual-tower architectures to large vision-languagemodels, prior works still face unavoidable challenges in real-worldapplications and business scenarios, such as the limited modality support,unstable training mechanisms, and industrial domain gaps. In this work, weintroduce SAIL-Embedding, an omni-modal embedding foundation model thataddresses these issues through tailored training strategies and architecturaldesign. In the optimization procedure, we propose a multi-stage training schemeto boost the multifaceted effectiveness of representation learning.Specifically, the content-aware progressive training aims to enhance themodel's adaptability to diverse downstream tasks and master enrichedcross-modal proficiency. The collaboration-aware recommendation enhancementtraining further adapts multimodal representations for recommendation scenariosby distilling knowledge from sequence-to-item and ID-to-item embeddings whilemining user historical interests. Concurrently, we develop the stochasticspecialization and dataset-driven pattern matching to strengthen model trainingflexibility and generalizability. Experimental results show that SAIL-Embeddingachieves SOTA performance compared to other methods in different retrievaltasks. In online experiments across various real-world scenarios integratedwith our model, we observe a significant increase in Lifetime (LT), which is acrucial indicator for the recommendation experience. For instance, the modeldelivers the 7-day LT gain of +0.158% and the 14-day LT gain of +0.144% in theDouyin-Selected scenario. For the Douyin feed rank model, the match featuresproduced by SAIL-Embedding yield a +0.08% AUC gain.