Given only a few glimpses of an environment, how much can we infer about itsentire floorplan? Existing methods can map only what is visible or immediatelyapparent from context, and thus require substantial movements through a spaceto fully map it. We explore how both audio and visual sensing together canprovide rapid floorplan reconstruction from limited viewpoints. Audio not onlyhelps sense geometry outside the camera's field of view, but it also revealsthe existence of distant freespace (e.g., a dog barking in another room) andsuggests the presence of rooms not visible to the camera (e.g., a dishwasherhumming in what must be the kitchen to the left). We introduce AV-Map, a novelmulti-modal encoder-decoder framework that reasons jointly about audio andvision to reconstruct a floorplan from a short input video sequence. We trainour model to predict both the interior structure of the environment and theassociated rooms' semantic labels. Our results on 85 large real-worldenvironments show the impact: with just a few glimpses spanning 26% of an area,we can estimate the whole area with 66% accuracy -- substantially better thanthe state of the art approach for extrapolating visual maps.