Vision-language (V+L) pretraining models have achieved great success insupporting multimedia applications by understanding the alignments betweenimages and text. While existing vision-language pretraining models primarilyfocus on understanding objects in images or entities in text, they often ignorethe alignment at the level of events and their argument structures. % In thiswork, we propose a contrastive learning framework to enforce vision-languagepretraining models to comprehend events and associated argument (participant)roles. To achieve this, we take advantage of text information extractiontechnologies to obtain event structural knowledge, and utilize multiple promptfunctions to contrast difficult negative descriptions by manipulating eventstructures. We also design an event graph alignment loss based on optimaltransport to capture event argument structures. In addition, we collect a largeevent-rich dataset (106,875 images) for pretraining, which provides a morechallenging image retrieval benchmark to assess the understanding ofcomplicated lengthy sentences. Experiments show that our zero-shot CLIP-Eventoutperforms the state-of-the-art supervised model in argument extraction onMultimedia Event Extraction, achieving more than 5\% absolute F-score gain inevent extraction, as well as significant improvements on a variety ofdownstream tasks under zero-shot settings.