Abstract
Language is highly structured, with syntactic and semantic structures, tosome extent, agreed upon by speakers of the same language. With implicit orexplicit awareness of such structures, humans can learn and use languageefficiently and generalize to sentences that contain unseen words. Motivated byhuman language learning, in this dissertation, we consider a family of machinelearning tasks that aim to learn language structures through grounding. We seekdistant supervision from other data sources (i.e., grounds), including but notlimited to other modalities (e.g., vision), execution results of programs, andother languages. We demonstrate the potential of this task formulation and advocate for itsadoption through three schemes. In Part I, we consider learning syntacticparses through visual grounding. We propose the task of visually groundedgrammar induction, present the first models to induce syntactic structures fromvisually grounded text and speech, and find that the visual grounding signalscan help improve the parsing quality over language-only models. As a sidecontribution, we propose a novel evaluation metric that enables the evaluationof speech parsing without text or automatic speech recognition systemsinvolved. In Part II, we propose two execution-aware methods to map sentencesinto corresponding semantic structures (i.e., programs), significantlyimproving compositional generalization and few-shot program synthesis. In PartIII, we propose methods that learn language structures from annotations inother languages. Specifically, we propose a method that sets a new state of theart on cross-lingual word alignment. We then leverage the learned wordalignments to improve the performance of zero-shot cross-lingual dependencyparsing, by proposing a novel substructure-based projection method thatpreserves structural knowledge learned from the source language.