Abstract
This paper presents GRASP, a novel benchmark to evaluate the languagegrounding and physical understanding capabilities of video-based multimodallarge language models (LLMs). This evaluation is accomplished via a two-tierapproach leveraging Unity simulations. The first level tests for languagegrounding by assessing a model's ability to relate simple textual descriptionswith visual information. The second level evaluates the model's understandingof "Intuitive Physics" principles, such as object permanence and continuity. Inaddition to releasing the benchmark, we use it to evaluate severalstate-of-the-art multimodal LLMs. Our evaluation reveals significantshortcomings in the language grounding and intuitive physics capabilities ofthese models. Although they exhibit at least some grounding capabilities,particularly for colors and shapes, these capabilities depend heavily on theprompting strategy. At the same time, all models perform below or at the chancelevel of 50% in the Intuitive Physics tests, while human subjects are onaverage 80% correct. These identified limitations underline the importance ofusing benchmarks like GRASP to monitor the progress of future models indeveloping these competencies.