Abstract
Surgical video-language pretraining (VLP) faces unique challenges due to theknowledge domain gap and the scarcity of multi-modal data. This study aims tobridge the gap by addressing issues regarding textual information loss insurgical lecture videos and the spatial-temporal challenges of surgical VLP. Wepropose a hierarchical knowledge augmentation approach and a novelProcedure-Encoded Surgical Knowledge-Augmented Video-Language Pretraining(PeskaVLP) framework to tackle these issues. The knowledge augmentation useslarge language models (LLM) for refining and enriching surgical concepts, thusproviding comprehensive language supervision and reducing the risk ofoverfitting. PeskaVLP combines language supervision with visualself-supervision, constructing hard negative samples and employing a DynamicTime Warping (DTW) based loss function to effectively comprehend thecross-modal procedural alignment. Extensive experiments on multiple publicsurgical scene understanding and cross-modal retrieval datasets show that ourproposed method significantly improves zero-shot transferring performance andoffers a generalist visual representation for further advancements in surgicalscene understanding.The code is available athttps://github.com/CAMMA-public/SurgVLP