Abstract
We introduce a general-purpose conditioning method for neural networks calledFiLM: Feature-wise Linear Modulation. FiLM layers influence neural networkcomputation via a simple, feature-wise affine transformation based onconditioning information. We show that FiLM layers are highly effective forvisual reasoning - answering image-related questions which require amulti-step, high-level process - a task which has proven difficult for standarddeep learning methods that do not explicitly model reasoning. Specifically, weshow on visual reasoning tasks that FiLM layers 1) halve state-of-the-art errorfor the CLEVR benchmark, 2) modulate features in a coherent manner, 3) arerobust to ablations and architectural modifications, and 4) generalize well tochallenging, new data from few examples or even zero-shot.