Characterizing Video Question Answering with Sparsified Inputs

Abstract

In Video Question Answering, videos are often processed as a full-lengthsequence of frames to ensure minimal loss of information. Recent works havedemonstrated evidence that sparse video inputs are sufficient to maintain highperformance. However, they usually discuss the case of single frame selection.In our work, we extend the setting to multiple number of inputs and othermodalities. We characterize the task with different input sparsity and providea tool for doing that. Specifically, we use a Gumbel-based learnable selectionmodule to adaptively select the best inputs for the final task. In this way, weexperiment over public VideoQA benchmarks and provide analysis on howsparsified inputs affect the performance. From our experiments, we haveobserved only 5.2%-5.8% loss of performance with only 10% of video lengths,which corresponds to 2-4 frames selected from each video. Meanwhile, we alsoobserved the complimentary behaviour between visual and textual inputs, evenunder highly sparsified settings, suggesting the potential of improving dataefficiency for video-and-language tasks.

Quick Read (beta)

loading the full paper ...