Sign language translation (SLT), which generates text in a spoken languagefrom visual content in a sign language, is important to assist thehard-of-hearing community for their communications. Inspired by neural machinetranslation (NMT), most existing SLT studies adopted a general sequence tosequence learning strategy. However, SLT is significantly different fromgeneral NMT tasks since sign languages convey messages through multiplevisual-manual aspects. Therefore, in this paper, these unique characteristicsof sign languages are formulated as hierarchical spatio-temporal graphrepresentations, including high-level and fine-level graphs of which a vertexcharacterizes a specified body part and an edge represents their interactions.Particularly, high-level graphs represent the patterns in the regions such ashands and face, and fine-level graphs consider the joints of hands andlandmarks of facial regions. To learn these graph patterns, a novel deeplearning architecture, namely hierarchical spatio-temporal graph neural network(HST-GNN), is proposed. Graph convolutions and graph self-attentions withneighborhood context are proposed to characterize both the local and the globalgraph properties. Experimental results on benchmark datasets demonstrated theeffectiveness of the proposed method.