PaLM-E: An Embodied Multimodal Language Model

  • 2023-03-06 18:58:06
  • Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, Pete Florence
Large language models excel at a wide range of complex tasks. However,enabling general inference in the real world, e.g., for robotics problems,raises the challenge of grounding. We propose embodied language models todirectly incorporate real-world continuous sensor modalities into languagemodels and thereby establish the link between words and percepts. Input to ourembodied language model are multi-modal sentences that interleave visual,continuous state estimation, and textual input encodings. We train theseencodings end-to-end, in conjunction with a pre-trained large language model,for multiple embodied tasks including sequential robotic manipulation planning,visual question answering, and captioning. Our evaluations show that PaLM-E, asingle large embodied multimodal model, can address a variety of embodiedreasoning tasks, from a variety of observation modalities, on multipleembodiments, and further, exhibits positive transfer: the model benefits fromdiverse joint training across internet-scale language, vision, andvisual-language domains. Our largest model, PaLM-E-562B with 562B parameters,in addition to being trained on robotics tasks, is a visual-language generalistwith state-of-the-art performance on OK-VQA, and retains generalist languagecapabilities with increasing scale.


