TIBSTC-CoT: A Multi-Domain Instruction Dataset for Chain-of-Thought Reasoning in Language Models

  • 2025-08-04 01:32:58
  • Fan Gao, Cheng Huang, Nyima Tashi, Yutong Liu, Xiangxiang Wang, Thupten Tsering, Ban Ma-bao, Renzeg Duojie, Gadeng Luosang, Rinchen Dongrub, Dorje Tashi, Xiao Feng, Hao Wang, Yongbin Yu
  • 0

Abstract

To address the severe data scarcity in Tibetan, a low-resource languagespoken by over six million people, we introduce TIBSTC-CoT, the large-scale,multi-domain Tibetan dataset automatically constructed via chain-of-thoughtprompting with large language models (LLMs). TIBSTC-CoT establishes a scalableand reproducible framework for dataset creation in low-resource settings,covering diverse domains and reasoning patterns essential for languageunderstanding and generation. Building on this dataset, we develop theSunshine-thinking LLM family, a series of Tibetan-centric LLMs equipped withchain-of-thought capabilities. Trained entirely on TIBSTC-CoT,Sunshine-thinking has demonstrated strong reasoning and generation performance,comparable to state-of-the-art (SOTA) multilingual LLMs. Our work marks asignificant step toward inclusive AI by enabling high-quality Tibetan languageprocessing through both resource creation and model innovation. All data areavailable: https://github.com/Vicentvankor/sun-shine.

 

Quick Read (beta)

loading the full paper ...