Abstract
Inspired by progress in large-scale language modeling, we apply a similarapproach towards building a single generalist agent beyond the realm of textoutputs. The agent, which we refer to as Gato, works as a multi-modal,multi-task, multi-embodiment generalist policy. The same network with the sameweights can play Atari, caption images, chat, stack blocks with a real robotarm and much more, deciding based on its context whether to output text, jointtorques, button presses, or other tokens. In this report we describe the modeland the data, and document the current capabilities of Gato.
Quick Read (beta)
loading the full paper ...