Towards Cultural Bridge by Bahnaric-Vietnamese Translation Using Transfer Learning of Sequence-To-Sequence Pre-training Language Model

  • 2025-05-16 17:33:36
  • Phan Tran Minh Dat, Vo Hoang Nhat Khang, Quan Thanh Tho
  • 0

Abstract

This work explores the journey towards achieving Bahnaric-Vietnamesetranslation for the sake of culturally bridging the two ethnic groups inVietnam. However, translating from Bahnaric to Vietnamese also encounters somedifficulties. The most prominent challenge is the lack of available originalBahnaric resources source language, including vocabulary, grammar, dialoguepatterns and bilingual corpus, which hinders the data collection process fortraining. To address this, we leverage a transfer learning approach usingsequence-to-sequence pre-training language model. First of all, we leverage apre-trained Vietnamese language model to capture the characteristics of thislanguage. Especially, to further serve the purpose of machine translation, weaim for a sequence-to-sequence model, not encoder-only like BERT ordecoder-only like GPT. Taking advantage of significant similarity between thetwo languages, we continue training the model with the currently limitedbilingual resources of Vietnamese-Bahnaric text to perform the transferlearning from language model to machine translation. Thus, this approach canhelp to handle the problem of imbalanced resources between two languages, whilealso optimizing the training and computational processes. Additionally, we alsoenhanced the datasets using data augmentation to generate additional resourcesand defined some heuristic methods to help the translation more precise. Ourapproach has been validated to be highly effective for the Bahnaric-Vietnamesetranslation model, contributing to the expansion and preservation of languages,and facilitating better mutual understanding between the two ethnic people.

 

Quick Read (beta)

loading the full paper ...