Abstract
Both manual (relating to the use of hands) and non-manual markers (NMM), suchas facial expressions or mouthing cues, are important for providing thecomplete meaning of phrases in American Sign Language (ASL). Efforts have beenmade in advancing sign language to spoken/written language understanding, butmost of these have primarily focused on manual features only. In this work,using advanced neural machine translation methods, we examine and report on theextent to which facial expressions contribute to understanding sign languagephrases. We present a sign language translation architecture consisting oftwo-stream encoders, with one encoder handling the face and the other handlingthe upper body (with hands). We propose a new parallel cross-attention decodingmechanism that is useful for quantifying the influence of each input modalityon the output. The two streams from the encoder are directed simultaneously todifferent attention stacks in the decoder. Examining the properties of theparallel cross-attention weights allows us to analyze the importance of facialmarkers compared to body and hand features during a translating task.