Abstract
Despite the success in specific scenarios, existing foundation agents stillstruggle to generalize across various virtual scenarios, mainly due to thedramatically different encapsulations of environments with manually designedobservation and action spaces. To handle this issue, we propose the GeneralComputer Control (GCC) setting to restrict foundation agents to interact withsoftware through the most unified and standardized interface, i.e., usingscreenshots as input and keyboard and mouse actions as output. We introduceCradle, a modular and flexible LMM-powered framework, as a preliminary attempttowards GCC. Enhanced by six key modules, Cradle can understand inputscreenshots and output executable code for low-level keyboard and mouse controlafter high-level planning, so that Cradle can interact with any software andcomplete long-horizon complex tasks without relying on any built-in APIs.Experimental results show that Cradle exhibits remarkable generalizability andimpressive performance across four previously unexplored commercial videogames, five software applications, and a comprehensive benchmark, OSWorld.Cradle is the first to enable foundation agents to follow the main storylineand complete 40-minute-long real missions in the complex AAA game Red DeadRedemption 2 (RDR2). Cradle can also create a city of a thousand people inCities: Skylines, farm and harvest parsnips in Stardew Valley, and trade andbargain with a maximal weekly total profit of 87% in Dealer's Life 2. Cradlecan not only operate daily software, like Chrome, Outlook, and Feishu, but alsoedit images and videos using Meitu and CapCut. Cradle greatly extends the reachof foundation agents by enabling the easy conversion of any software,especially complex games, into benchmarks to evaluate agents' various abilitiesand facilitate further data collection, thus paving the way for generalistagents.