We combine neural rendering with multi-modal image and text representationsto synthesize diverse 3D objects solely from natural language descriptions. Ourmethod, Dream Fields, can generate the geometry and color of a wide range ofobjects without 3D supervision. Due to the scarcity of diverse, captioned 3Ddata, prior methods only generate objects from a handful of categories, such asShapeNet. Instead, we guide generation with image-text models pre-trained onlarge datasets of captioned images from the web. Our method optimizes a NeuralRadiance Field from many camera views so that rendered images score highly witha target caption according to a pre-trained CLIP model. To improve fidelity andvisual quality, we introduce simple geometric priors, includingsparsity-inducing transmittance regularization, scene bounds, and new MLParchitectures. In experiments, Dream Fields produce realistic, multi-viewconsistent object geometry and color from a variety of natural languagecaptions.