Abstract
Large language models (LLMs) are increasingly used in clinical decisionsupport, yet current evaluation methods often fail to distinguish genuinemedical reasoning from superficial patterns. We introduce DeVisE (Demographicsand Vital signs Evaluation), a behavioral testing framework for probingfine-grained clinical understanding. We construct a dataset of ICU dischargenotes from MIMIC-IV, generating both raw (real-world) and template-based(synthetic) versions with controlled single-variable counterfactuals targetingdemographic (age, gender, ethnicity) and vital sign attributes. We evaluatefive LLMs spanning general-purpose and medically fine-tuned variants, underboth zero-shot and fine-tuned settings. We assess model behavior via (1)input-level sensitivity - how counterfactuals alter the likelihood of a note;and (2) downstream reasoning - how they affect predicted hospitallength-of-stay. Our results show that zero-shot models exhibit more coherentcounterfactual reasoning patterns, while fine-tuned models tend to be morestable yet less responsive to clinically meaningful changes. Notably,demographic factors subtly but consistently influence outputs, emphasizing theimportance of fairness-aware evaluation. This work highlights the utility ofbehavioral testing in exposing the reasoning strategies of clinical LLMs andinforming the design of safer, more transparent medical AI systems.