This paper aims to reduce the rendering time of generalizable radiancefields. Some recent works equip neural radiance fields with image encoders andare able to generalize across scenes, which avoids the per-scene optimization.However, their rendering process is generally very slow. A major factor is thatthey sample lots of points in empty space when inferring radiance fields. Inthis paper, we present a hybrid scene representation which combines the best ofimplicit radiance fields and explicit depth maps for efficient rendering.Specifically, we first build the cascade cost volume to efficiently predict thecoarse geometry of the scene. The coarse geometry allows us to sample fewpoints near the scene surface and significantly improves the rendering speed.This process is fully differentiable, enabling us to jointly learn the depthprediction and radiance field networks from only RGB images. Experiments showthat the proposed approach exhibits state-of-the-art performance on the DTU,Real Forward-facing and NeRF Synthetic datasets, while being at least 50 timesfaster than previous generalizable radiance field methods. We also demonstratethe capability of our method to synthesize free-viewpoint videos of dynamichuman performers in real-time. The code will be available athttps://zju3dv.github.io/enerf/.