Abstract
Large language models excel at a wide range of complex tasks. However,enabling general inference in the real world, e.g., for robotics problems,raises the challenge of grounding. We propose embodied language models todirectly incorporate real-world continuous sensor modalities into languagemodels and thereby establish the link between words and percepts. Input to ourembodied language model are multi-modal sentences that interleave visual,continuous state estimation, and textual input encodings. We train theseencodings end-to-end, in conjunction with a pre-trained large language model,for multiple embodied tasks including sequential robotic manipulation planning,visual question answering, and captioning. Our evaluations show that PaLM-E, asingle large embodied multimodal model, can address a variety of embodiedreasoning tasks, from a variety of observation modalities, on multipleembodiments, and further, exhibits positive transfer: the model benefits fromdiverse joint training across internet-scale language, vision, andvisual-language domains. Our largest model, PaLM-E-562B with 562B parameters,in addition to being trained on robotics tasks, is a visual-language generalistwith state-of-the-art performance on OK-VQA, and retains generalist languagecapabilities with increasing scale.