Gemino: Practical and Robust Neural Compression for Video Conferencing

Abstract

Video conferencing systems suffer from poor user experience when networkconditions deteriorate because current video codecs simply cannot operate atextremely low bitrates. Recently, several neural alternatives have beenproposed that reconstruct talking head videos at very low bitrates using sparserepresentations of each frame such as facial landmark information. However,these approaches produce poor reconstructions in scenarios with major movementor occlusions over the course of a call, and do not scale to higherresolutions. We design Gemino, a new neural compression system for videoconferencing based on a novel high-frequency-conditional super-resolutionpipeline. Gemino upsamples a very low-resolution version of each target framewhile enhancing high-frequency details (e.g., skin texture, hair, etc.) basedon information extracted from a single high-resolution reference image. We usea multi-scale architecture that runs different components of the model atdifferent resolutions, allowing it to scale to resolutions comparable to 720p,and we personalize the model to learn specific details of each person,achieving much better fidelity at low bitrates. We implement Gemino atopaiortc, an open-source Python implementation of WebRTC, and show that itoperates on 1024x1024 videos in real-time on a A100 GPU, and achieves 2.9xlower bitrate than traditional video codecs for the same perceptual quality.

Quick Read (beta)

loading the full paper ...