EvSign: Sign Language Recognition and Translation with Streaming Events

Abstract

Sign language is one of the most effective communication tools for peoplewith hearing difficulties. Most existing works focus on improving theperformance of sign language tasks on RGB videos, which may suffer fromdegraded recording conditions, such as fast movement of hands with motion blurand textured signer's appearance. The bio-inspired event camera, whichasynchronously captures brightness change with high speed, could naturallyperceive dynamic hand movements, providing rich manual clues for sign languagetasks. In this work, we aim at exploring the potential of event camera incontinuous sign language recognition (CSLR) and sign language translation(SLT). To promote the research, we first collect an event-based benchmarkEvSign for those tasks with both gloss and spoken language annotations. EvSigndataset offers a substantial amount of high-quality event streams and anextensive vocabulary of glosses and words, thereby facilitating the developmentof sign language tasks. In addition, we propose an efficient transformer-basedframework for event-based SLR and SLT tasks, which fully leverages theadvantages of streaming events. The sparse backbone is employed to extractvisual features from sparse events. Then, the temporal coherence is effectivelyutilized through the proposed local token fusion and gloss-aware temporalaggregation modules. Extensive experimental results are reported on bothsimulated (PHOENIX14T) and EvSign datasets. Our method performs favorablyagainst existing state-of-the-art approaches with only 0.34% computational cost(0.84G FLOPS per video) and 44.2% network parameters. The project is availableat https://zhang-pengyu.github.io/EVSign.

Quick Read (beta)

loading the full paper ...