Abstract
End-to-end training of multi-agent systems offers significant advantages inimproving multi-task performance. However, training such models remainschallenging and requires extensive manual design and monitoring. In this work,we introduce TurboTrain, a novel and efficient training framework formulti-agent perception and prediction. TurboTrain comprises two key components:a multi-agent spatiotemporal pretraining scheme based on masked reconstructionlearning and a balanced multi-task learning strategy based on gradient conflictsuppression. By streamlining the training process, our framework eliminates theneed for manually designing and tuning complex multi-stage training pipelines,substantially reducing training time and improving performance. We evaluateTurboTrain on a real-world cooperative driving dataset, V2XPnP-Seq, anddemonstrate that it further improves the performance of state-of-the-artmulti-agent perception and prediction models. Our results highlight thatpretraining effectively captures spatiotemporal multi-agent features andsignificantly benefits downstream tasks. Moreover, the proposed balancedmulti-task learning strategy enhances detection and prediction.