Abstract
Processing spatial data is a key component in many learning tasks forautonomous driving such as motion forecasting, multi-agent simulation, andplanning. Prior works have demonstrated the value in using SE(2) invariantnetwork architectures that consider only the relative poses between objects(e.g. other agents, scene features such as traffic lanes). However, thesemethods compute the relative poses for all pairs of objects explicitly,requiring quadratic memory. In this work, we propose a mechanism for SE(2)invariant scaled dot-product attention that requires linear memory relative tothe number of objects in the scene. Our SE(2) invariant transformerarchitecture enjoys the same scaling properties that have benefited largelanguage models in recent years. We demonstrate experimentally that ourapproach is practical to implement and improves performance compared tocomparable non-invariant architectures.