We study joint learning of Convolutional Neural Network (CNN) and Transformerfor vision-language pre-training (VLPT) which aims to learn cross-modalalignments from millions of image-text pairs. State-of-the-art approachesextract salient image regions and align regions with words step-by-step. Asregion-based visual features usually represent parts of an image, it ischallenging for existing vision-language models to fully understand thesemantics from paired natural languages. In this paper, we propose SOHO to "SeeOut of tHe bOx" that takes a whole image as input, and learns vision-languagerepresentation in an end-to-end manner. SOHO does not require bounding boxannotations which enables inference 10 times faster than region-basedapproaches. In particular, SOHO learns to extract comprehensive yet compactimage features through a visual dictionary (VD) that facilitates cross-modalunderstanding. VD is designed to represent consistent visual abstractions ofsimilar semantics. It is updated on-the-fly and utilized in our proposedpre-training task Masked Visual Modeling (MVM). We conduct experiments on fourwell-established vision-language tasks by following standard VLPT settings. Inparticular, SOHO achieves absolute gains of 2.0% [email protected] score on MSCOCO textretrieval 5k test split, 1.5% accuracy on NLVR$^2$ test-P split, 6.7% accuracyon SNLI-VE test split, respectively.