Abstract
Scheduled sampling is a technique for avoiding one of the known problems insequence-to-sequence generation: exposure bias. It consists of feeding themodel a mix of the teacher forced embeddings and the model predictions from theprevious step in training time. The technique has been used for improving themodel performance with recurrent neural networks (RNN). In the Transformermodel, unlike the RNN, the generation of a new word attends to the fullsentence generated so far, not only to the last word, and it is notstraightforward to apply the scheduled sampling technique. We propose somestructural changes to allow scheduled sampling to be applied to Transformerarchitecture, via a two-pass decoding strategy. Experiments on two languagepairs achieve performance close to a teacher-forcing baseline and show thatthis technique is promising for further exploration.