Grounded language acquisition -- learning how language-based interactionsrefer to the world around them -- is amajor area of research in robotics, NLP,and HCI. In practice the data used for learning consists almost entirely oftextual descriptions, which tend to be cleaner, clearer, and more grammaticalthan actual human interactions. In this work, we present the Grounded LanguageDataset (GoLD), a multimodal dataset of common household objects described bypeople using either spoken or written language. We analyze the differences andpresent an experiment showing how the different modalities affect languagelearning from human in-put. This will enable researchers studying theintersection of robotics, NLP, and HCI to better investigate how the multiplemodalities of image, text, and speech interact, as well as show differences inthe vernacular of these modalities impact results.