Machine translation is a popular test bed for research in neuralsequence-to-sequence models but despite much recent research, there is still alack of understanding of these models. Practitioners report performancedegradation with large beams, the under-estimation of rare words and a lack ofdiversity in the final translations. Our study relates some of these issues tothe inherent uncertainty of the task, due to the existence of multiple validtranslations for a single source sentence, and to the extrinsic uncertaintycaused by noisy training data. We propose tools and metrics to assess howuncertainty in the data is captured by the model distribution and how itaffects search strategies that generate translations. Our results show thatsearch works remarkably well but that the models tend to spread too muchprobability mass over the hypothesis space. Next, we propose tools to assessmodel calibration and show how to easily fix some shortcomings of currentmodels. We release both code and multiple human reference translations for twopopular benchmarks.