Abstract
Virtual Reality (VR) games require players to translate high-level semanticactions into precise device manipulations using controllers and head-mounteddisplays (HMDs). While humans intuitively perform this translation based oncommon sense and embodied understanding, whether Large Language Models (LLMs)can effectively replicate this ability remains underexplored. This paperintroduces a benchmark, ComboBench, evaluating LLMs' capability to translatesemantic actions into VR device manipulation sequences across 262 scenariosfrom four popular VR games: Half-Life: Alyx, Into the Radius, Moss: Book II,and Vivecraft. We evaluate seven LLMs, including GPT-3.5, GPT-4, GPT-4o,Gemini-1.5-Pro, LLaMA-3-8B, Mixtral-8x7B, and GLM-4-Flash, compared againstannotated ground truth and human performance. Our results reveal that whiletop-performing models like Gemini-1.5-Pro demonstrate strong task decompositioncapabilities, they still struggle with procedural reasoning and spatialunderstanding compared to humans. Performance varies significantly acrossgames, suggesting sensitivity to interaction complexity. Few-shot examplessubstantially improve performance, indicating potential for targetedenhancement of LLMs' VR manipulation capabilities. We release all materials athttps://sites.google.com/view/combobench.