Referring object detection and referring image segmentation are importanttasks that require joint understanding of visual information and naturallanguage. Yet there has been evidence that current benchmark datasets sufferfrom bias, and current state-of-the-art models cannot be easily evaluated ontheir intermediate reasoning process. To address these issues and complementsimilar efforts in visual question answering, we build CLEVR-Ref+, a syntheticdiagnostic dataset for referring expression comprehension. The preciselocations and attributes of the objects are readily available, and thereferring expressions are automatically associated with functional programs.The synthetic nature allows control over dataset bias (through samplingstrategy), and the modular programs enable intermediate reasoning ground truthwithout human annotators. In addition to evaluating several state-of-the-art models on CLEVR-Ref+, wealso propose IEP-Ref, a module network approach that significantly outperformsother models on our dataset. In particular, we present two interesting andimportant findings using IEP-Ref: (1) the module trained to transform featuremaps into segmentation masks can be attached to any intermediate module toreveal the entire reasoning process step-by-step; (2) even if all training datahas at least one object referred, IEP-Ref can correctly predict no-foregroundwhen presented with false-premise referring expressions. To the best of ourknowledge, this is the first direct and quantitative proof that neural modulesbehave in the way they are intended.