To make effective decisions in novel environments with long-horizon goals, itis crucial to engage in hierarchical reasoning across spatial and temporalscales. This entails planning abstract subgoal sequences, visually reasoningabout the underlying plans, and executing actions in accordance with thedevised plan through visual-motor control. We propose Compositional FoundationModels for Hierarchical Planning (HiP), a foundation model which leveragesmultiple expert foundation model trained on language, vision and action dataindividually jointly together to solve long-horizon tasks. We use a largelanguage model to construct symbolic plans that are grounded in the environmentthrough a large video diffusion model. Generated video plans are then groundedto visual-motor control, through an inverse dynamics model that infers actionsfrom generated videos. To enable effective reasoning within this hierarchy, weenforce consistency between the models via iterative refinement. We illustratethe efficacy and adaptability of our approach in three different long-horizontable-top manipulation tasks.