Abstract
We present SignCLIP, which re-purposes CLIP (Contrastive Language-ImagePretraining) to project spoken language text and sign language videos, twoclasses of natural languages of distinct modalities, into the same space.SignCLIP is an efficient method of learning useful visual representations forsign language processing from large-scale, multilingual video-text pairs,without directly optimizing for a specific task or sign language which is oftenof limited size. We pretrain SignCLIP on Spreadthesign, a prominent sign language dictionaryconsisting of ~500 thousand video clips in up to 44 sign languages, andevaluate it with various downstream datasets. SignCLIP discerns in-domainsigning with notable text-to-video/video-to-text retrieval accuracy. It alsoperforms competitively for out-of-domain downstream tasks such as isolated signlanguage recognition upon essential few-shot prompting or fine-tuning. We analyze the latent space formed by the spoken language text and signlanguage poses, which provides additional linguistic insights. Our code andmodels are openly available.