Abstract
The objective of this work is the effective extraction of spatial and dynamicfeatures for Continuous Sign Language Recognition (CSLR). To accomplish this,we utilise a two-pathway SlowFast network, where each pathway operates atdistinct temporal resolutions to separately capture spatial (hand shapes,facial expressions) and dynamic (movements) information. In addition, weintroduce two distinct feature fusion methods, carefully designed for thecharacteristics of CSLR: (1) Bi-directional Feature Fusion (BFF), whichfacilitates the transfer of dynamic semantics into spatial semantics and viceversa; and (2) Pathway Feature Enhancement (PFE), which enriches dynamic andspatial representations through auxiliary subnetworks, while avoiding the needfor extra inference time. As a result, our model further strengthens spatialand dynamic representations in parallel. We demonstrate that the proposedframework outperforms the current state-of-the-art performance on popular CSLRdatasets, including PHOENIX14, PHOENIX14-T, and CSL-Daily.