Abstract
We address the challenge of representation learning from a continuous streamof video as input, in a self-supervised manner. This differs from the standardapproaches to video learning where videos are chopped and shuffled duringtraining in order to create a non-redundant batch that satisfies theindependently and identically distributed (IID) sample assumption expected byconventional training paradigms. When videos are only available as a continuousstream of input, the IID assumption is evidently broken, leading to poorperformance. We demonstrate the drop in performance when moving from shuffledto sequential learning on three tasks: the one-video representation learningmethod DoRA, standard VideoMAE on multi-video datasets, and the task of futurevideo prediction. To address this drop, we propose a geometric modification tostandard optimizers, to decorrelate batches by utilising orthogonal gradientsduring training. The proposed modification can be applied to any optimizer --we demonstrate it with Stochastic Gradient Descent (SGD) and AdamW. Ourproposed orthogonal optimizer allows models trained from streaming videos toalleviate the drop in representation learning performance, as evaluated ondownstream tasks. On three scenarios (DoRA, VideoMAE, future prediction), weshow our orthogonal optimizer outperforms the strong AdamW in all threescenarios.