LLaSM: Large Language and Speech Model

  • 2023-09-12 04:41:35
  • Yu Shu, Siwei Dong, Guangyao Chen, Wenhao Huang, Ruihua Zhang, Daochen Shi, Qiqi Xiang, Yemin Shi
Multi-modal large language models have garnered significant interestrecently. Though, most of the works focus on vision-language multi-modal modelsproviding strong capabilities in following vision-and-language instructions.However, we claim that speech is also an important modality through whichhumans interact with the world. Hence, it is crucial for a general-purposeassistant to be able to follow multi-modal speech-and-language instructions. Inthis work, we propose Large Language and Speech Model (LLaSM). LLaSM is anend-to-end trained large multi-modal speech-language model with cross-modalconversational abilities, capable of following speech-and-languageinstructions. Our early experiments show that LLaSM demonstrates a moreconvenient and natural way for humans to interact with artificial intelligence.Specifically, we also release a large Speech Instruction Following datasetLLaSM-Audio-Instructions. Code and demo are available athttps://github.com/LinkSoul-AI/LLaSM andhttps://huggingface.co/spaces/LinkSoul/LLaSM. The LLaSM-Audio-Instructionsdataset is available athttps://huggingface.co/datasets/LinkSoul/LLaSM-Audio-Instructions.


