This paper proposes a novel multi-modal transformer network for detectingactions in untrimmed videos. To enrich the action features, our transformernetwork utilizes a new multi-modal attention mechanism that computes thecorrelations between different spatial and motion modalities combinations.Exploring such correlations for actions has not been attempted previously. Touse the motion and spatial modality more effectively, we suggest an algorithmthat corrects the motion distortion caused by camera movement. Such motiondistortion, common in untrimmed videos, severely reduces the expressive powerof motion features such as optical flow fields. Our proposed algorithmoutperforms the state-of-the-art methods on two public benchmarks, THUMOS14 andActivityNet. We also conducted comparative experiments on our new instructionalactivity dataset, including a large set of challenging classroom videoscaptured from elementary schools.