Abstract
Tibetan, a minority language in China, features a highly intricategrammatical structure, characterized by four verb tenses and a tense systemwith frequent irregularities, contributing to its extensive inflectionaldiversity. Recently, advances in Large Language Models (LLMs) have transformedthe paradigm in many domains. Despite the success in other fields, current LLMsoften fall short in catering to the needs of domain experts like Tibetans, andthe potential of LLMs for Tibetan culture is under-explored. The intrinsicreasons are the immense and intricate nature of Tibetan culture as well as thenecessity for higher granularity and richness in knowledge. Simultaneously, thecomplexity and uniqueness of its grammatical structure, coupled with its statusas a minority ethnic language, contribute to data scarcity, which remains afundamental challenge. To alleviate these issues, we introduce Llama-Sunshine(Sun-Shine), the first large language model for Tibetan culture, which isexpert in various Tibetan language processing tasks. Sun-Shine incorporatesstate-of-the-art model architectures optimized for Tibetan's linguisticfeatures. We also propose TIB-STC, a comprehensive dataset comprising diverseTibetan texts such as literature, religious scripts, news, and conversationaldata, which is also the first large-scale dataset for Tibetan culture. Thoughcomprehensive experiments, Sun-Shine not only demonstrates a higher level ofknowledge expertise for Tibetan culture but also gains preliminary embodiedintelligence capabilities in Tibetan language processing tasks, like languagemodeling, text classification, machine translation, and syntactic analysis.Moreover, it excels in low-resource scenarios, showcasing strong generalizationcapabilities.