EVA02-AT: Egocentric Video-Language Understanding with Spatial-Temporal Rotary Positional Embeddings and Symmetric Optimization

Abstract

Egocentric video-language understanding demands both high efficiency andaccurate spatial-temporal modeling. Existing approaches face three keychallenges: 1) Excessive pre-training cost arising from multi-stagepre-training pipelines, 2) Ineffective spatial-temporal encoding due tomanually split 3D rotary positional embeddings that hinder featureinteractions, and 3) Imprecise learning objectives in soft-label multi-instanceretrieval, which neglect negative pair correlations. In this paper, weintroduce EVA02-AT, a suite of EVA02-based video-language foundation modelstailored to egocentric video understanding tasks. EVA02-AT first efficientlytransfers an image-based CLIP model into a unified video encoder via asingle-stage pretraining. Second, instead of applying rotary positionalembeddings to isolated dimensions, we introduce spatial-temporal rotarypositional embeddings along with joint attention, which can effectively encodeboth spatial and temporal information on the entire hidden dimension. Thisjoint encoding of spatial-temporal features enables the model to learncross-axis relationships, which are crucial for accurately modeling motion andinteraction in videos. Third, focusing on multi-instance video-languageretrieval tasks, we introduce the Symmetric Multi-Similarity (SMS) loss and anovel training framework that advances all soft labels for both positive andnegative pairs, providing a more precise learning objective. Extensiveexperiments on Ego4D, EPIC-Kitchens-100, and Charades-Ego under zero-shot andfine-tuning settings demonstrate that EVA02-AT achieves state-of-the-artperformance across diverse egocentric video-language tasks with fewerparameters. Models with our SMS loss also show significant performance gains onmulti-instance retrieval benchmarks. Our code and models are publicly availableat https://github.com/xqwang14/EVA02-AT .

Quick Read (beta)

loading the full paper ...