Abstract
Sign language recognition (SLR) is a machine learning task aiming to identifysigns in videos. Due to the scarcity of annotated data, unsupervised methodslike contrastive learning have become promising in this field. They learnmeaningful representations by pulling positive pairs (two augmented versions ofthe same instance) closer and pushing negative pairs (different from thepositive pairs) apart. In SLR, in a sign video, only certain parts provideinformation that is truly useful for its recognition. Applying contrastivemethods to SLR raises two issues: (i) contrastive learning methods treat allparts of a video in the same way, without taking into account the relevance ofcertain parts over others; (ii) shared movements between different signs makenegative pairs highly similar, complicating sign discrimination. These issueslead to learning non-discriminative features for sign recognition and poorresults in downstream tasks. In response, this paper proposes a self-supervisedlearning framework designed to learn meaningful representations for SLR. Thisframework consists of two key components designed to work together: (i) a newself-supervised approach with free-negative pairs; (ii) a new data augmentationtechnique. This approach shows a considerable gain in accuracy compared toseveral contrastive and self-supervised methods, across linear evaluation,semi-supervised learning, and transferability between sign languages.