Abstract
Embedding-Based Retrieval (EBR) is an important technique in modern searchengines, enabling semantic match between search queries and relevant results.However, search logging data on platforms like Facebook Marketplace lacks thediversity and details needed for effective EBR model training, limiting themodels' ability to capture nuanced search patterns. To address this challenge,we propose Aug2Search, an EBR-based framework leveraging synthetic datagenerated by Generative AI (GenAI) models, in a multimodal and multitaskapproach to optimize query-product relevance. This paper investigates thecapabilities of GenAI, particularly Large Language Models (LLMs), in generatinghigh-quality synthetic data, and analyzing its impact on enhancing EBR models.We conducted experiments using eight Llama models and 100 million data pointsfrom Facebook Marketplace logs. Our synthetic data generation follows threestrategies: (1) generate queries, (2) enhance product listings, and (3)generate queries from enhanced listings. We train EBR models on three differentdatasets: sampled engagement data or original data ((e.g., "Click" and "ListingInteractions")), synthetic data, and a mixture of both engagement and syntheticdata to assess their performance across various training sets. Our findingsunderscore the robustness of Llama models in producing synthetic queries andlistings with high coherence, relevance, and diversity, while maintaining lowlevels of hallucination. Aug2Search achieves an improvement of up to 4% inROC_AUC with 100 million synthetic data samples, demonstrating theeffectiveness of our approach. Moreover, our experiments reveal that with thesame volume of training data, models trained exclusively on synthetic dataoften outperform those trained on original data only or a mixture of originaland synthetic data.