Autonomous agents have made great strides in specialist domains like Atarigames and Go. However, they typically learn tabula rasa in isolatedenvironments with limited and manually conceived objectives, thus failing togeneralize across a wide spectrum of tasks and capabilities. Inspired by howhumans continually learn and adapt in the open world, we advocate a trinity ofingredients for building generalist agents: 1) an environment that supports amultitude of tasks and goals, 2) a large-scale database of multimodalknowledge, and 3) a flexible and scalable agent architecture. We introduceMineDojo, a new framework built on the popular Minecraft game that features asimulation suite with thousands of diverse open-ended tasks and aninternet-scale knowledge base with Minecraft videos, tutorials, wiki pages, andforum discussions. Using MineDojo's data, we propose a novel agent learningalgorithm that leverages large pre-trained video-language models as a learnedreward function. Our agent is able to solve a variety of open-ended tasksspecified in free-form language without any manually designed dense shapingreward. We open-source the simulation suite and knowledge bases(https://minedojo.org) to promote research towards the goal of generallycapable embodied agents.