Abstract
Large language models (LLMs) have exhibited remarkable capabilities acrossvarious domains and tasks, pushing the boundaries of our knowledge in learningand cognition. The latest model, OpenAI's o1, stands out as the first LLM withan internalized chain-of-thought technique using reinforcement learningstrategies. While it has demonstrated surprisingly strong capabilities onvarious general language tasks, its performance in specialized fields such asmedicine remains unknown. To this end, this report provides a comprehensiveexploration of o1 on different medical scenarios, examining 3 key aspects:understanding, reasoning, and multilinguality. Specifically, our evaluationencompasses 6 tasks using data from 37 medical datasets, including two newlyconstructed and more challenging question-answering (QA) tasks based onprofessional medical quizzes from the New England Journal of Medicine (NEJM)and The Lancet. These datasets offer greater clinical relevance compared tostandard medical QA benchmarks such as MedQA, translating more effectively intoreal-world clinical utility. Our analysis of o1 suggests that the enhancedreasoning ability of LLMs may (significantly) benefit their capability tounderstand various medical instructions and reason through complex clinicalscenarios. Notably, o1 surpasses the previous GPT-4 in accuracy by an averageof 6.2% and 6.6% across 19 datasets and two newly created complex QA scenarios.But meanwhile, we identify several weaknesses in both the model capability andthe existing evaluation protocols, including hallucination, inconsistentmultilingual ability, and discrepant metrics for evaluation. We release our rawdata and model outputs at https://ucsc-vlaa.github.io/o1_medicine/ for futureresearch.