TIBSTC-CoT: A Multi-Domain Instruction Dataset for Chain-of-Thought Reasoning in Language Models

Abstract

To address the severe data scarcity in Tibetan, a low-resource languagespoken by over six million people, we introduce TIBSTC-CoT, the large-scale,multi-domain Tibetan dataset automatically constructed via chain-of-thoughtprompting with large language models (LLMs). TIBSTC-CoT establishes a scalableand reproducible framework for dataset creation in low-resource settings,covering diverse domains and reasoning patterns essential for languageunderstanding and generation. Building on this dataset, we develop theSunshine-thinking LLM family, a series of Tibetan-centric LLMs equipped withchain-of-thought capabilities. Trained entirely on TIBSTC-CoT,Sunshine-thinking has demonstrated strong reasoning and generation performance,comparable to state-of-the-art (SOTA) multilingual LLMs. Our work marks asignificant step toward inclusive AI by enabling high-quality Tibetan languageprocessing through both resource creation and model innovation. All data areavailable: https://github.com/Vicentvankor/sun-shine.

Quick Read (beta)

loading the full paper ...