Abstract
Medical reports contain rich clinical information but are often unstructuredand written in domain-specific language, posing challenges for informationextraction. While proprietary large language models (LLMs) have shown promisein clinical natural language processing, their lack of transparency and dataprivacy concerns limit their utility in healthcare. This study thereforeevaluates nine open-source generative LLMs on the DRAGON benchmark, whichincludes 28 clinical information extraction tasks in Dutch. We developed\texttt{llm\_extractinator}, a publicly available framework for informationextraction using open-source generative LLMs, and used it to assess modelperformance in a zero-shot setting. Several 14 billion parameter models,Phi-4-14B, Qwen-2.5-14B, and DeepSeek-R1-14B, achieved competitive results,while the bigger Llama-3.3-70B model achieved slightly higher performance atgreater computational cost. Translation to English prior to inferenceconsistently degraded performance, highlighting the need of native-languageprocessing. These findings demonstrate that open-source LLMs, when used withour framework, offer effective, scalable, and privacy-conscious solutions forclinical information extraction in low-resource settings.