All-in-One Image-Grounded Conversational Agents

Abstract

As single-task accuracy on individual language and image tasks has improvedsubstantially in the last few years, the long-term goal of a generally skilledagent that can both see and talk becomes more feasible to explore. In thiswork, we focus on leveraging individual language and image tasks, along withresources that incorporate both vision and language towards that objective. Wedesign an architecture that combines state-of-the-art Transformer and ResNeXtmodules fed into a novel attentive multimodal module to produce a combinedmodel trained on many tasks. We provide a thorough analysis of the componentsof the model, and transfer performance when training on one, some, or all ofthe tasks. Our final models provide a single system that obtains good resultson all vision and language tasks considered, and improves the state-of-the-artin image-grounded conversational applications.

Quick Read (beta)

loading the full paper ...