The vast majority of language pairs in the world are low-resource becausethey have little, if any, parallel data available. Unfortunately, machinetranslation (MT) systems do not currently work well in this setting. Besidesthe technical challenges of learning with limited supervision, there is alsoanother challenge: it is very difficult to evaluate methods trained on lowresource language pairs because there are very few freely and publiclyavailable benchmarks. In this work, we take sentences from Wikipedia pages andintroduce new evaluation datasets in two very low resource language pairs,Nepali-English and Sinhala-English. These are languages with very differentmorphology and syntax, for which little out-of-domain parallel data isavailable and for which relatively large amounts of monolingual data are freelyavailable. We describe our process to collect and cross-check the quality oftranslations, and we report baseline performance using several learningsettings: fully supervised, weakly supervised, semi-supervised, and fullyunsupervised. Our experiments demonstrate that current state-of-the-art methodsperform rather poorly on this benchmark, posing a challenge to the researchcommunity working on low resource MT. Data and code to reproduce ourexperiments are available at https://github.com/facebookresearch/flores.