Visually grounded few-shot word learning in low-resource settings

Abstract

We propose a visually grounded speech model that learns new words and theirvisual depictions from just a few word-image example pairs. Given a set of testimages and a spoken query, we ask the model which image depicts the query word.Previous work has simplified this few-shot learning problem by either using anartificial setting with digit word-image pairs or by using a large number ofexamples per class. Moreover, all previous studies were performed using Englishspeech-image data. We propose an approach that can work on natural word-imagepairs but with less examples, i.e. fewer shots, and then illustrate how thisapproach can be applied for multimodal few-shot learning in a real low-resourcelanguage, Yor\`ub\'a. Our approach involves using the given word-image examplepairs to mine new unsupervised word-image training pairs from large collectionsof unlabelled speech and images. Additionally, we use a word-to-image attentionmechanism to determine word-image similarity. With this new model, we achievebetter performance with fewer shots than previous approaches on an existingEnglish benchmark. Many of the model's mistakes are due to confusion betweenvisual concepts co-occurring in similar contexts. The experiments on Yor\`ub\'ashow the benefit of transferring knowledge from a multimodal model trained on alarger set of English speech-image data.

Quick Read (beta)

loading the full paper ...