Abstract
Tool design and use reflect the ability to understand and manipulate thephysical world through creativity, planning, and foresight. As such, thesecapabilities are often regarded as measurable indicators of intelligence acrossbiological species. While much of today's research on robotic intelligencefocuses on generating better controllers, inventing smarter tools offers acomplementary form of physical intelligence: shifting the onus ofproblem-solving onto the tool's design. Given the vast and impressivecommon-sense, reasoning, and creative capabilities of today's foundationmodels, we investigate whether these models can provide useful priors toautomatically design and effectively wield such tools? We present VLMgineer, aframework that harnesses the code generation abilities of vision languagemodels (VLMs) together with evolutionary search to iteratively co-designphysical tools and the action plans that operate them to perform a task. Weevaluate VLMgineer on a diverse new benchmark of everyday manipulationscenarios that demand creative tool design and use. Across this suite,VLMgineer consistently discovers tools and policies that solve tasks moreeffectively and innovatively, transforming challenging robotics problems intostraightforward executions. It also outperforms VLM-generated designs fromhuman specifications and existing human-crafted tools for everyday tasks. Tofacilitate future research on automated tool invention, we will release ourbenchmark and code.