CoLLIE: Continual Learning of Language Grounding from Language-Image Embeddings

Abstract

This paper presents CoLLIE: a simple, yet effective model for continuallearning of how language is grounded in vision. Given a pre-trained multimodalembedding model, where language and images are projected in the same semanticspace (in this case CLIP by OpenAI), CoLLIE learns a transformation functionthat adjusts the language embeddings when needed to accommodate new languageuse. This is done by predicting the difference vector that needs to be applied,as well as a scaling factor for this vector, so that the adjustment is onlyapplied when needed. Unlike traditional few-shot learning, the model does notjust learn new classes and labels, but can also generalize to similar languageuse and leverage semantic compositionality. We verify the model's performanceon two different tasks of identifying the targets of referring expressions,where it has to learn new language use. The results show that the model canefficiently learn and generalize from only a few examples, with littleinterference with the model's original zero-shot performance.

Quick Read (beta)

loading the full paper ...