LLaSM: Large Language and Speech Model

Abstract

Multi-modal large language models have garnered significant interestrecently. Though, most of the works focus on vision-language multi-modal modelsproviding strong capabilities in following vision-and-language instructions.However, we claim that speech is also an important modality through whichhumans interact with the world. Hence, it is crucial for a general-purposeassistant to be able to follow multi-modal speech-and-language instructions. Inthis work, we propose Large Language and Speech Model (LLaSM). LLaSM is anend-to-end trained large multi-modal speech-language model with cross-modalconversational abilities, capable of following speech-and-languageinstructions. Our early experiments show that LLaSM demonstrates a moreconvenient and natural way for humans to interact with artificial intelligence.Specifically, we also release a large Speech Instruction Following datasetLLaSM-Audio-Instructions. Code and demo are available athttps://github.com/LinkSoul-AI/LLaSM andhttps://huggingface.co/spaces/LinkSoul/LLaSM. The LLaSM-Audio-Instructionsdataset is available athttps://huggingface.co/datasets/LinkSoul/LLaSM-Audio-Instructions.

Quick Read (beta)

loading the full paper ...