Abstract
Transformer structure, stacked by a sequence of encoder and decoder networklayers, achieves significant development in neural machine translation.However, vanilla Transformer mainly exploits the top-layer representation,assuming the lower layers provide trivial or redundant information and thusignoring the bottom-layer feature that is potentially valuable. In this work,we propose the Group-Transformer model (GTrans) that flexibly dividesmulti-layer representations of both encoder and decoder into different groupsand then fuses these group features to generate target words. To corroboratethe effectiveness of the proposed method, extensive experiments and analyticexperiments are conducted on three bilingual translation benchmarks and twomultilingual translation tasks, including the IWLST-14, IWLST-17, LDC, WMT-14and OPUS-100 benchmark. Experimental and analytical results demonstrate thatour model outperforms its Transformer counterparts by a consistent gain.Furthermore, it can be successfully scaled up to 60 encoder layers and 36decoder layers.