We propose the Unified Visual-Semantic Embeddings (Unified VSE) for learninga joint space of visual representation and textual semantics. The model unifiesthe embeddings of concepts at different levels: objects, attributes, relations,and full scenes. We view the sentential semantics as a combination of differentsemantic components such as objects and relations; their embeddings are alignedwith different image regions. A contrastive learning approach is proposed forthe effective learning of this fine-grained alignment from only image-captionpairs. We also present a simple yet effective approach that enforces thecoverage of caption embeddings on the semantic components that appear in thesentence. We demonstrate that the Unified VSE outperforms baselines oncross-modal retrieval tasks; the enforcement of the semantic coverage improvesthe model's robustness in defending text-domain adversarial attacks. Moreover,our model empowers the use of visual cues to accurately resolve worddependencies in novel sentences.