CoLLIE: Continual Learning of Language Grounding from Language-Image Embeddings

Abstract

This paper presents CoLLIE: a simple, yet effective model for continuallearning of how language is grounded in vision. Given a pre-trained multimodalembedding model, where language and images are projected in the same semanticspace (in this case CLIP by OpenAI), CoLLIE learns a transformation functionthat adjusts the language embeddings when needed to accommodate new languageuse. Unlike traditional few-shot learning, the model does not just learn newclasses and labels, but can also generalize to similar language use. We verifythe model's performance on two different tasks of continual learning and showthat it can efficiently learn and generalize from only a few examples, withlittle interference with the model's original zero-shot performance.

Quick Read (beta)

loading the full paper ...