Abstract
This study presents TSLFormer, a light and robust word-level Turkish SignLanguage (TSL) recognition model that treats sign gestures as ordered,string-like language. Instead of using raw RGB or depth videos, our method onlyworks with 3D joint positions - articulation points - extracted using Google'sMediapipe library, which focuses on the hand and torso skeletal locations. Thiscreates efficient input dimensionality reduction while preserving importantsemantic gesture information. Our approach revisits sign language recognitionas sequence-to-sequence translation, inspired by the linguistic nature of signlanguages and the success of transformers in natural language processing. SinceTSLFormer uses the self-attention mechanism, it effectively captures temporalco-occurrence within gesture sequences and highlights meaningful motionpatterns as words unfold. Evaluated on the AUTSL dataset with over 36,000samples and 227 different words, TSLFormer achieves competitive performancewith minimal computational cost. These results show that joint-based input issufficient for enabling real-time, mobile, and assistive communication systemsfor hearing-impaired individuals.