Streetscapes: Large-scale Consistent Street View Generation Using Autoregressive Video Diffusion

Abstract

We present a method for generating Streetscapes-long sequences of viewsthrough an on-the-fly synthesized city-scale scene. Our generation isconditioned by language input (e.g., city name, weather), as well as anunderlying map/layout hosting the desired trajectory. Compared to recent modelsfor video generation or 3D view synthesis, our method can scale to muchlonger-range camera trajectories, spanning several city blocks, whilemaintaining visual quality and consistency. To achieve this goal, we build onrecent work on video diffusion, used within an autoregressive framework thatcan easily scale to long sequences. In particular, we introduce a new temporalimputation method that prevents our autoregressive approach from drifting fromthe distribution of realistic city imagery. We train our Streetscapes system ona compelling source of data-posed imagery from Google Street View, along withcontextual map data-which allows users to generate city views conditioned onany desired city layout, with controllable camera poses. Please see moreresults at our project page at https://boyangdeng.com/streetscapes.

Quick Read (beta)

loading the full paper ...