Abstract
The rapid rise of video diffusion models has enabled the generation of highlyrealistic and temporally coherent videos, raising critical concerns aboutcontent authenticity, provenance, and misuse. Existing watermarking approaches,whether passive, post-hoc, or adapted from image-based techniques, oftenstruggle to withstand video-specific manipulations such as frame insertion,dropping, or reordering, and typically degrade visual quality. In this work, weintroduce VIDSTAMP, a watermarking framework that embeds per-frame orper-segment messages directly into the latent space of temporally-aware videodiffusion models. By fine-tuning the model's decoder through a two-stagepipeline, first on static image datasets to promote spatial message separation,and then on synthesized video sequences to restore temporal consistency,VIDSTAMP learns to embed high-capacity, flexible watermarks with minimalperceptual impact. Leveraging architectural components such as 3D convolutionsand temporal attention, our method imposes no additional inference cost andoffers better perceptual quality than prior methods, while maintainingcomparable robustness against common distortions and tampering. VIDSTAMP embeds768 bits per video (48 bits per frame) with a bit accuracy of 95.0%, achieves alog P-value of -166.65 (lower is better), and maintains a video quality scoreof 0.836, comparable to unwatermarked outputs (0.838) and surpassing priormethods in capacity-quality tradeoffs. Code: Code:\url{https://github.com/SPIN-UMass/VidStamp}