Abstract
In this paper, we introduce S-MedQA, an English medical question-answering(QA) dataset for benchmarking large language models in fine-grained clinicalspecialties. We use S-MedQA to check the applicability of a popular hypothesisrelated to knowledge injection in the knowledge-intense scenario of medical QA,and show that: 1) training on data from a speciality does not necessarily leadto best performance on that specialty and 2) regardless of the specialtyfine-tuned on, token probabilities of clinically relevant terms for allspecialties increase consistently. Thus, we believe improvement gains comemostly from domain shifting (e.g., general to medical) rather than knowledgeinjection and suggest rethinking the role of fine-tuning data in the medicaldomain. We release S-MedQA and all code needed to reproduce all our experimentsto the research community.