Abstract
Indigenous languages are a fundamental legacy in the development of humancommunication, embodying the unique identity and culture of local communitiesin America. The Second AmericasNLP (Americas Natural Language Processing)Competition Track 1 of NeurIPS (Neural Information Processing Systems) 2022proposed the task of training automatic speech recognition (ASR) systems forfive Indigenous languages: Quechua, Guarani, Bribri, Kotiria, and Wa'ikhana. Inthis paper, we describe the fine-tuning of a state-of-the-art ASR model foreach target language, using approximately 36.65 h of transcribed speech datafrom diverse sources enriched with data augmentation methods. We systematicallyinvestigate, using a Bayesian search, the impact of the differenthyperparameters on the Wav2vec2.0 XLS-R (Cross-Lingual Speech Representations)variants of 300 M and 1 B parameters. Our findings indicate that data anddetailed hyperparameter tuning significantly affect ASR accuracy, but languagecomplexity determines the final result. The Quechua model achieved the lowestcharacter error rate (CER) (12.14), while the Kotiria model, despite having themost extensive dataset during the fine-tuning phase, showed the highest CER(36.59). Conversely, with the smallest dataset, the Guarani model achieved aCER of 15.59, while Bribri and Wa'ikhana obtained, respectively, CERs of 34.70and 35.23. Additionally, Sobol' sensitivity analysis highlighted the crucialroles of freeze fine-tuning updates and dropout rates. We release our bestmodels for each language, marking the first open ASR models for Wa'ikhana andKotiria. This work opens avenues for future research to advance ASR techniquesin preserving minority Indigenous languages