We propose a method for editing images from human instructions: given aninput image and a written instruction that tells the model what to do, ourmodel follows these instructions to edit the image. To obtain training data forthis problem, we combine the knowledge of two large pretrained models -- alanguage model (GPT-3) and a text-to-image model (Stable Diffusion) -- togenerate a large dataset of image editing examples. Our conditional diffusionmodel, InstructPix2Pix, is trained on our generated data, and generalizes toreal images and user-written instructions at inference time. Since it performsedits in the forward pass and does not require per example fine-tuning orinversion, our model edits images quickly, in a matter of seconds. We showcompelling editing results for a diverse collection of input images and writteninstructions.