Abstract
Reinforcement learning from verifiable rewards (RLVR) has recently gainedattention for its ability to elicit self-evolved reasoning capabilitie frombase language models without explicit reasoning supervisions, as demonstratedby DeepSeek-R1. While prior work on RLVR has primarily focused on mathematicaland coding domains, its applicability to other tasks and domains remainsunexplored. In this work, we investigate whether medical reasoning can emergefrom RLVR. We introduce Med-RLVR as an initial study of RLVR in the medicaldomain leveraging medical multiple-choice question answering (MCQA) data asverifiable labels. Our results demonstrate that RLVR is not only effective formath and coding but also extends successfully to medical question answering.Notably, Med-RLVR achieves performance comparable to traditional supervisedfine-tuning (SFT) on in-distribution tasks while significantly improvingout-of-distribution generalization, with an 8-point accuracy gain. Furtheranalysis of training dynamics reveals that, with no explicit reasoningsupervision, reasoning emerges from the 3B-parameter base model. These findingsunderscore the potential of RLVR in domains beyond math and coding, opening newavenues for its application in knowledge-intensive fields such as medicine.