A great deal of progress has been made in image captioning, driven byresearch into how to encode the image using pre-trained models. This includesvisual encodings (e.g. image grid features or detected objects) and morerecently textual encodings (e.g. image tags or text descriptions of imageregions). As more advanced encodings are available and incorporated, it isnatural to ask: how to efficiently and effectively leverage the heterogeneousset of encodings? In this paper, we propose to regard the encodings asaugmented views of the input image. The image captioning model encodes eachview independently with a shared encoder efficiently, and a contrastive loss isincorporated across the encoded views in a novel way to improve theirrepresentation quality and the model's data efficiency. Our proposedhierarchical decoder then adaptively weighs the encoded views according totheir effectiveness for caption generation by first aggregating within eachview at the token level, and then across views at the view level. Wedemonstrate significant performance improvements of +5.6% CIDEr on MS-COCO and+12.9% CIDEr on Flickr30k compared to state of the arts, and conduct rigorousanalyses to demonstrate the importance of each part of our design.