We study the problem of jointly reasoning about language and vision through anavigation and spatial reasoning task. We introduce the Touchdown task anddataset, where an agent must first follow navigation instructions in areal-life visual urban environment to a goal position, and then identify in theobserved image a location described in natural language to find a hiddenobject. The data contains 9,326 examples of English instructions and spatialdescriptions paired with demonstrations. We perform qualitative linguisticanalysis, and show that the data displays richer use of spatial reasoningcompared to related resources. Empirical analysis shows the data presents anopen challenge to existing methods.