Unsupervised Video Summarization via Reinforcement Learning and a Trained Evaluator

Abstract

This paper presents a novel approach for unsupervised video summarizationusing reinforcement learning. It aims to address the existing limitations ofcurrent unsupervised methods, including unstable training of adversarialgenerator-discriminator architectures and reliance on hand-crafted rewardfunctions for quality evaluation. The proposed method is based on the conceptthat a concise and informative summary should result in a reconstructed videothat closely resembles the original. The summarizer model assigns an importancescore to each frame and generates a video summary. In the proposed scheme,reinforcement learning, coupled with a unique reward generation pipeline, isemployed to train the summarizer model. The reward generation pipeline trainsthe summarizer to create summaries that lead to improved reconstructions. Itcomprises a generator model capable of reconstructing masked frames from apartially masked video, along with a reward mechanism that compares thereconstructed video from the summary against the original. The video generatoris trained in a self-supervised manner to reconstruct randomly masked frames,enhancing its ability to generate accurate summaries. This training pipelineresults in a summarizer model that better mimics human-generated videosummaries compared to methods relying on hand-crafted rewards. The trainingprocess consists of two stable and isolated training steps, unlike adversarialarchitectures. Experimental results demonstrate promising performance, withF-scores of 62.3 and 54.5 on TVSum and SumMe datasets, respectively.Additionally, the inference stage is 300 times faster than our previouslyreported state-of-the-art method.

Quick Read (beta)

loading the full paper ...