This research strives for natural language moment retrieval in long,untrimmed video streams. The problem nevertheless is not trivial especiallywhen a video contains multiple moments of interests and the language describescomplex temporal dependencies, which often happens in real scenarios. Weidentify two crucial challenges: semantic misalignment and structuralmisalignment. However, existing approaches treat different moments separatelyand do not explicitly model complex moment-wise temporal relations. In thispaper, we present Moment Alignment Network (MAN), a novel framework thatunifies the candidate moment encoding and temporal structural reasoning in asingle-shot feed-forward network. MAN naturally assigns candidate momentrepresentations aligned with language semantics over different temporallocations and scales. Most importantly, we propose to explicitly modelmoment-wise temporal relations as a structured graph and devise an iterativegraph adjustment network to jointly learn the best structure in an end-to-endmanner. We evaluate the proposed approach on two challenging public benchmarksCharades-STA and DiDeMo, where our MAN significantly outperforms thestate-of-the-art by a large margin.