We introduce a class of causal video understanding models that aims toimprove efficiency of video processing by maximising throughput, minimisinglatency, and reducing the number of clock cycles. Leveraging operationpipelining and multi-rate clocks, these models perform a minimal amount ofcomputation (e.g. as few as four convolutional layers) for each frame pertimestep to produce an output. The models are still very deep, with dozens ofsuch operations being performed but in a pipelined fashion that enablesdepth-parallel computation. We illustrate the proposed principles by applyingthem to existing image architectures and analyse their behaviour on two videotasks: action recognition and human keypoint localisation. The results showthat a significant degree of parallelism, and implicitly speedup, can beachieved with little loss in performance.