Abstract
Perceiving the world from both egocentric (first-person) and exocentric(third-person) perspectives is fundamental to human cognition, enabling richand complementary understanding of dynamic environments. In recent years,allowing the machines to leverage the synergistic potential of these dualperspectives has emerged as a compelling research direction in videounderstanding. In this survey, we provide a comprehensive review of videounderstanding from both exocentric and egocentric viewpoints. We begin byhighlighting the practical applications of integrating egocentric andexocentric techniques, envisioning their potential collaboration acrossdomains. We then identify key research tasks to realize these applications.Next, we systematically organize and review recent advancements into three mainresearch directions: (1) leveraging egocentric data to enhance exocentricunderstanding, (2) utilizing exocentric data to improve egocentric analysis,and (3) joint learning frameworks that unify both perspectives. For eachdirection, we analyze a diverse set of tasks and relevant works. Additionally,we discuss benchmark datasets that support research in both perspectives,evaluating their scope, diversity, and applicability. Finally, we discusslimitations in current works and propose promising future research directions.By synthesizing insights from both perspectives, our goal is to inspireadvancements in video understanding and artificial intelligence, bringingmachines closer to perceiving the world in a human-like manner. A GitHub repoof related works can be found athttps://github.com/ayiyayi/Awesome-Egocentric-and-Exocentric-Vision.