The blocks world is a classic toy domain that has long been used to build andtest spatial reasoning systems. Despite its relative simplicity, tackling thisdomain in its full complexity requires the agent to exhibit a rich set offunctional capabilities, ranging from vision to natural language understanding.There is currently a resurgence of interest in solving problems in such limiteddomains using modern techniques. In this work we tackle spatial questionanswering in a holistic way, using a vision system, speech input and outputmediated by an animated avatar, a dialogue system that robustly interpretsspatial queries, and a constraint solver that derives answers based on 3-Dspatial modeling. The contributions of this work include a semantic parser thatmaps spatial questions into logical forms consistent with a general approach tomeaning representation, a dialog manager based on a schema representation, anda constraint solver for spatial questions that provides answers in agreementwith human perception. These and other components are integrated into amulti-modal human-computer interaction pipeline.