WaTeRFlow icon

WaTeRFlow: Watermark Temporal Robustness
via Flow Consistency

Utae Jeong 1 Sumin In 1 Hyunju Ryu 1 Jaewan Choi 1
Feng Yang 2 Jongheon Jeong 1 Seungryong Kim 3 Sangpil Kim 1*
1 Korea University 2 Google DeepMind 3 KAIST AI

Overview

Teaser
The overview of WaTeRFlow. Left: The watermark encoder is optimized to embed the watermark while preserving quality in both pixel and latent space, and it is trained to keep the watermarked image semantically close to the original. Middle: Image editing and video generation are performed by an image editing proxy and a video diffusion proxy, respectively, and the generated frames are then warped to the first frame. Right: The decoder processes the images produced by FUSE to decode the embedded watermark and compute the training loss. Overall, these components enable watermark insertion and detection that are robust to image-to-video generation.

Abstract

Image watermarking supports authenticity and provenance, yet many schemes are still easy to bypass with various distortions and powerful generative edits. Deep learning-based watermarking has improved robustness to diffusion-based image editing, but a gap remains when a watermarked image is converted to video by image-to-video (I2V), in which per-frame watermark detection weakens. I2V has quickly advanced from short, jittery clips to multi-second, temporally coherent scenes, and it now serves not only content creation but also world-modeling and simulation workflows, making cross-modal watermark recovery crucial. We present WaTeRFlow, a framework tailored for robustness under I2V. It consists of (i) FUSE (Flow-guided Unified Synthesis Engine), which exposes the encoder–decoder to realistic distortions via instruction-driven edits and a fast video diffusion proxy during training, (ii) optical-flow warping with a Temporal Consistency Loss (TCL) that stabilizes per-frame predictions, and (iii) a semantic preservation loss that maintains the conditioning signal. Experiments across representative I2V models show accurate watermark recovery from frames, with higher first-frame and per-frame bit accuracy and resilience when various distortions are applied before or after video generation.

Quantitative comparison

Quantitative Comparison Additional Quantitative Comparison
Per-frame bit accuracy and I2V robustness. Each plot visualizes bit accuracy on the even-numbered frames after image-to-video (I2V) generation. Across two representative I2V models, our method achieves the highest average bit accuracy compared to the baselines. It also shows the strongest robustness in the image-to-video (I2V) generation following image editing.

Qualitative comparison


Qualitative Comparison
Qualitative results. Top: The original image and the watermarked images for each watermarking method. Middle: From left to right, the 24-th frames generated using SVD-XT are shown for the original image, our method, and the baselines. Bottom: Frames generated by CogVideoX. From left to right, we present the 24-th frames from the videos generated from the original image, then our method, followed by the baselines. Our method shows the highest bit accuracy for both video generation models in the given frames.

Video results

SVD-XT. These videos were generated with SVD-XT using watermarked images. The w/o Watermark refers to a video generated from images without any watermark.
CogVideoX. These videos were generated with CogVideoX using watermarked images. The w/o Watermark refers to a video generated from images without any watermark.

BibTeX

@misc{jeong2025waterflowwatermarktemporalrobustness,
  title={WaTeRFlow: Watermark Temporal Robustness via Flow Consistency},
  author={Utae Jeong and Sumin In and Hyunju Ryu and Jaewan Choi and Feng Yang and Jongheon Jeong and Seungryong Kim and Sangpil Kim},
  year={2025},
  eprint={2512.19048},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2512.19048},
}