AlpaCare:Instruction-tuned Large Language Models for Medical Application

Abstract

Instruction-finetuning (IFT) has become crucial in aligning Large LanguageModels (LLMs) with diverse human needs and has shown great potential in medicalapplications. However, previous studies mainly fine-tune LLMs on biomedicaldatasets with limited diversity, which often rely on benchmarks or narrow taskscopes, and hence significantly limit the effectiveness on their medicalinstruction-following ability and generalizability. To bridge this gap, wepropose creating a diverse, machine-generated medical IFT dataset,MedInstruct-52k, using GPT-4 and ChatGPT with a high-quality expert-curatedseed set. We then fine-tune LLaMA-series models on the dataset to developAlpaCare. Despite using a smaller domain-specific dataset than previous medicalLLMs, AlpaCare not only demonstrates superior performance on medicalapplications, with up to 38.1% absolute gain over best baselines in medicalfree-form instruction evaluations, but also achieves 6.7% absolute gainsaveraged over multiple general domain benchmarks. Human evaluation furthershows that AlpaCare consistently outperforms best baselines in terms of bothcorrectness and helpfulness. We offer public access to our data, model, andcodebase in https://github.com/XZhang97666/AlpaCare.

Quick Read (beta)

loading the full paper ...