The role of synthetic data in Multilingual, Multi-cultural AI systems: Lessons from Indic Languages

  • 2025-09-25 15:13:00
  • Pranjal A. Chitale, Varun Gumma, Sanchit Ahuja, Prashant Kodali, Manan Uppadhyay, Deepthi Sudharsan, Sunayana Sitaram
  • 0

Abstract

Developing AI systems that operate effectively across languages whileremaining culturally grounded is a long-standing challenge, particularly inlow-resource settings. Synthetic data provides a promising avenue, yet itseffectiveness in multilingual and multicultural contexts remains underexplored.We investigate the creation and impact of synthetic, culturally contextualizeddatasets for Indian languages through a bottom-up generation strategy thatprompts large open-source LLMs (>= 235B parameters) to ground data generationin language-specific Wikipedia content. This approach complements the dominanttop-down paradigm of translating synthetic datasets from high-resourcelanguages such as English. We introduce Updesh, a high-quality large-scalesynthetic instruction-following dataset comprising 9.5M data points across 13Indian languages, encompassing diverse reasoning and generative tasks with anemphasis on long-context, multi-turn capabilities, and alignment with Indiancultural contexts. A comprehensive evaluation incorporating both automatedmetrics and human annotation across 10k assessments indicates that generateddata is high quality; though, human evaluation highlights areas for furtherimprovement. Additionally, we perform downstream evaluations by fine-tuningmodels on our dataset and assessing the performance across 15 diversemultilingual datasets. Models trained on Updesh consistently achievesignificant gains on generative tasks and remain competitive on multiple-choicestyle NLU tasks. Notably, relative improvements are most pronounced in low andmedium-resource languages, narrowing their gap with high-resource languages.These findings provide empirical evidence that effective multilingual AIrequires multi-faceted data curation and generation strategies that incorporatecontext-aware, culturally grounded methodologies.

 

Quick Read (beta)

loading the full paper ...