Abstract
Audio language models can understand audio inputs and perform a range ofaudio-related tasks based on instructions, such as speech recognition and audiocaptioning, where the instructions are usually textual prompts. Audio languagemodels are mostly initialized from pre-trained audio encoders and largelanguage models (LLMs). Although these pre-trained components were developed tosupport multiple languages, audio-language models are trained predominantly onEnglish data, which may limit their usability to only English instructions orEnglish speech inputs. First, this paper examines the performance of existingaudio language models in an underserved language using Thai as an example. Thispaper demonstrates that, despite being built on multilingual backbones, audiolanguage models do not exhibit cross-lingual emergent abilities to low-resourcelanguages. Second, this paper studies data mixture for developing audiolanguage models that are optimized for a target language as well as English. Inaddition. this paper integrates audio comprehension and speechinstruction-following capabilities into a single unified model. Our experimentsprovide insights into data mixture for enhancing instruction-followingcapabilities in both a low-resource language and English. Our model,Typhoon-Audio, outperforms existing open-source audio language models by aconsiderable margin, and it is comparable to state-of-the-art Gemini-1.5-Pro inboth English and Thai languages.