Current vision language pretraining models are dominated by methods usingregion visual features extracted from object detectors. Given their goodperformance, the extract-then-process pipeline significantly restricts theinference speed and therefore limits their real-world use cases. However,training vision language models from raw image pixels is difficult, as the rawimage pixels give much less prior knowledge than region features. In thispaper, we systematically study how to leverage auxiliary visual pretrainingtasks to help training end-to-end vision language models. We introduce threetypes of visual losses that enable much faster convergence and betterfinetuning accuracy. Compared with region feature models, our end-to-end modelscould achieve similar or better performance on downstream tasks and run morethan 10 times faster during inference. Compared with other end-to-end models,our proposed method could achieve similar or better performance when pretrainedfor only 10% of the pretraining GPU hours.