Abstract
Video generation techniques have made remarkable progress, promising to bethe foundation of interactive world exploration. However, existing videogeneration datasets are not well-suited for world exploration training as theysuffer from some limitations: limited locations, short duration, static scenes,and a lack of annotations about exploration and the world. In this paper, weintroduce Sekai (meaning ``world'' in Japanese), a high-quality first-personview worldwide video dataset with rich annotations for world exploration. Itconsists of over 5,000 hours of walking or drone view (FPV and UVA) videos fromover 100 countries and regions across 750 cities. We develop an efficient andeffective toolbox to collect, pre-process and annotate videos with location,scene, weather, crowd density, captions, and camera trajectories. Experimentsdemonstrate the quality of the dataset. And, we use a subset to train aninteractive video world exploration model, named YUME (meaning ``dream'' inJapanese). We believe Sekai will benefit the area of video generation and worldexploration, and motivate valuable applications.