WaTeRFlow: Watermark Temporal Robustness via Flow Consistency

Overview

The overview of WaTeRFlow. Left: The watermark encoder is optimized to embed the watermark while preserving quality in both pixel and latent space, and it is trained to keep the watermarked image semantically close to the original. Middle: Image editing and video generation are performed by an image editing proxy and a video diffusion proxy, respectively, and the generated frames are then warped to the first frame. Right: The decoder processes the images produced by FUSE to decode the embedded watermark and compute the training loss. Overall, these components enable watermark insertion and detection that are robust to image-to-video generation.

Abstract

Image watermarking supports authenticity and provenance, yet many schemes are still easy to bypass with various distortions and powerful generative edits. Deep learning-based watermarking has improved robustness to diffusion-based image editing, but a gap remains when a watermarked image is converted to video by image-to-video (I2V), in which per-frame watermark detection weakens. I2V has quickly advanced from short, jittery clips to multi-second, temporally coherent scenes, and it now serves not only content creation but also world-modeling and simulation workflows, making cross-modal watermark recovery crucial. We present WaTeRFlow, a framework tailored for robustness under I2V. It consists of (i) FUSE (Flow-guided Unified Synthesis Engine), which exposes the encoder–decoder to realistic distortions via instruction-driven edits and a fast video diffusion proxy during training, (ii) optical-flow warping with a Temporal Consistency Loss (TCL) that stabilizes per-frame predictions, and (iii) a semantic preservation loss that maintains the conditioning signal. Experiments across representative I2V models show accurate watermark recovery from frames, with higher first-frame and per-frame bit accuracy and resilience when various distortions are applied before or after video generation.

Quantitative comparison

Per-frame bit accuracy and I2V robustness. Each plot visualizes bit accuracy on the even-numbered frames after image-to-video (I2V) generation. Across two representative I2V models, our method achieves the highest average bit accuracy compared to the baselines. It also shows the strongest robustness in the image-to-video (I2V) generation following image editing.

Qualitative comparison

Qualitative results. Top: The original image and the watermarked images for each watermarking method. Middle: From left to right, the 24-th frames generated using SVD-XT are shown for the original image, our method, and the baselines. Bottom: Frames generated by CogVideoX. From left to right, we present the 24-th frames from the videos generated from the original image, then our method, followed by the baselines. Our method shows the highest bit accuracy for both video generation models in the given frames.

Video results

SVD-XT. These videos were generated with SVD-XT using watermarked images. The w/o Watermark refers to a video generated from images without any watermark.

CogVideoX. These videos were generated with CogVideoX using watermarked images. The w/o Watermark refers to a video generated from images without any watermark.

@InProceedings{Jeong_2026_CVPR, author = {Jeong, Utae and In, Sumin and Ryu, Hyunju and Choi, Jaewan and Yang, Feng and Jeong, Jongheon and Kim, Seungryong and Kim, Sangpil}, title = {WaTeRFlow: Watermark Temporal Robustness via Flow Consistency}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {31703-31713} }

WaTeRFlow: Watermark Temporal Robustnessvia Flow Consistency

Overview

Abstract

Quantitative comparison

Qualitative comparison

Video results

BibTeX

WaTeRFlow: Watermark Temporal Robustness
via Flow Consistency