A dominant paradigm for learning-based approaches in computer vision istraining generic models, such as ResNet for image recognition, or I3D for videounderstanding, on large datasets and allowing them to discover the optimalrepresentation for the problem at hand. While this is an obviously attractiveapproach, it is not applicable in all scenarios. We claim that action detectionis one such challenging problem - the models that need to be trained are large,and labeled data is expensive to obtain. To address this limitation, we proposeto incorporate domain knowledge into the structure of the model, simplifyingoptimization. In particular, we augment a standard I3D network with a trackingmodule to aggregate long term motion patterns, and use a graph convolutionalnetwork to reason about interactions between actors and objects. Evaluated onthe challenging AVA dataset, the proposed approach improves over the I3Dbaseline by 5.5% mAP and over the state-of-the-art by 4.8% mAP.