It is still challenging to build an AI system that can perform tasks thatinvolve vision and language at human level. So far, researchers have singledout individual tasks separately, for each of which they have designed networksand trained them on its dedicated datasets. Although this approach has seen acertain degree of success, it comes with difficulties of understandingrelations among different tasks and transferring the knowledge learned for atask to others. We propose a multi-task learning approach that enables to learnvision-language representation that is shared by many tasks from their diversedatasets. The representation is hierarchical, and prediction for each task iscomputed from the representation at its corresponding level of the hierarchy.We show through experiments that our method consistently outperforms previoussingle-task-learning methods on image caption retrieval, visual questionanswering, and visual grounding. We also analyze the learned hierarchicalrepresentation by visualizing attention maps generated in our network.