Exploring large-scale pretrained foundation models is of significant interestin computer vision because these models can be quickly transferred to manydownstream tasks. This paper presents Contrastive Captioner (CoCa), aminimalist design to pretrain an image-text encoder-decoder foundation modeljointly with contrastive loss and captioning loss, thereby subsuming modelcapabilities from contrastive approaches like CLIP and generative methods likeSimVLM. In contrast to standard encoder-decoder transformers where all decoderlayers attend to encoder outputs, CoCa omits cross-attention in the first halfof decoder layers to encode unimodal text representations, and cascades theremaining decoder layers which cross-attend to the image encoder for multimodalimage-text representations. We apply a contrastive loss between unimodal imageand text embeddings, in addition to a captioning loss on the multimodal decoderoutputs which predicts text tokens autoregressively. By sharing the samecomputational graph, the two training objectives are computed efficiently withminimal overhead. CoCa is pretrained end-to-end and from scratch on bothweb-scale alt-text data and annotated images by treating all labels simply astext, seamlessly unifying natural language supervision for representationlearning. Empirically, CoCa achieves state-of-the-art performance withzero-shot transfer or minimal task-specific adaptation on a broad range ofdownstream tasks, spanning visual recognition (ImageNet, Kinetics-400/600/700,Moments-in-Time), crossmodal retrieval (MSCOCO, Flickr30K, MSR-VTT), multimodalunderstanding (VQA, SNLI-VE, NLVR2), and image captioning (MSCOCO, NoCaps).Notably on ImageNet classification, CoCa obtains 86.3% zero-shot top-1accuracy, 90.6% with a frozen encoder and learned classification head, and newstate-of-the-art 91.0% top-1 accuracy on ImageNet with a finetuned encoder.