Abstract
Leading vision-language models (VLMs) are trained on general Internetcontent, overlooking scientific journals' rich, domain-specific knowledge.Training on specialty-specific literature could yield high-performance,task-specific tools, enabling generative AI to match generalist models inspecialty publishing, educational, and clinical tasks. We created NeuroPubs, amultimodal dataset of 23,000 Neurosurgery Publications articles (134M words,78K image-caption pairs). Using NeuroPubs, VLMs generated publication-readygraphical abstracts (70% of 100 abstracts) and board-style questionsindistinguishable from human-written ones (54% of 89,587 questions). We usedthese questions to train CNS-Obsidian, a 34B-parameter VLM. In a blinded,randomized controlled trial, our model demonstrated non-inferiority to thenstate-of-the-art GPT-4o in neurosurgical differential diagnosis (clinicalutility, 40.62% upvotes vs. 57.89%, p=0.1150; accuracy, 59.38% vs. 65.79%,p=0.3797). Our pilot study demonstrates how training generative AI models onspecialty-specific journal content - without large-scale internet data -results in high-performance academic and clinical tools, enablingdomain-tailored AI across diverse fields.