MTVG : Multi-text Video Generation
with Text-to-Video Models

Gyeongrok Oh1, Jaehwan Jeong 1, Sieun Kim1, Wonmin Byeon2, Jinkyu Kim3, Sungwoong Kim1, Hyeokmin Kwon1, Sangpil Kim1
1Department of Artificial Intelligence, Korea University
2NVIDIA Research, NVIDIA Corporation
3Department of Computer Science and Engineering, Korea University

Abstract

Recently, video generation has attracted massive attention and yielded noticeable outcomes. Concerning the characteristics of video, multi-text conditioning incorporating sequential events is necessary for next-step video generation. In this work, we propose a novel multi-text video generation (MTVG) by directly utilizing a pre-trained diffusion based text-to-video (T2V) generation model without additional fine-tuning. To generate consecutive video segments, visual consistency generated by distinct prompts is necessary with diverse variations, such as motion and content related transitions. MTVG includes Dynamic Noise and Last Frame Aware Inversion which reinitialize the noise latent to preserve spatial contexts between video clips of different prompts and prevent repetitive motion or contents. Furthermore, we present Structure Guiding Sampling to maintain the global appearance across the frames in a single video clip, where we leverage iterative latent updates across the preceding frame. Additionally, our Prompt Generator allows for arbitrary format of text conditions consisting of diverse events. As a result, our extensive experiments, including diverse transitions of descriptions, demonstrate that our proposed methods show superior generated outputs in terms of semantically coherent and temporally seamless video.

Overview

Overview of MTVG

MTVG synthesizes the consecutive video clips corresponding to distinct prompts. The overall pipeline comprises two major components: last frame-aware latent initialization and structure-guided sampling. First, in the last frame-aware latent initialization, the pre-trained text-to-video generation model adopts the repeated frame as an input to invert into the initial latent code with two novel techniques: dynamic noise and last frame-aware inversion. Second, structure-guided sampling enforces continuity within a video clip by updating the latent code.

Comparison

1. Quantitative Results

quantitative results

2. Qualitative Results

Prompts

A man rides a bicycle on a beautiful tropical beach at sunset of 4k high resolution.
→ A man walks on a beautiful tropical beach at sunset of 4k high resolution.
→ A man reads a book on a beautiful tropical beach at sunset of 4k high resolution.

Ours

DirecT2V

T2V-Zero

Gen-L-Video

VidRD

Qualitative Results

1. LVDM (256x256)

Prompts

1. "There is a beach where there is no one."
2. "The waves hit the deserted beach."
3. "There is a beach that has been swept away by waves."

Prompts

1. "The volcano erupts in the clear weather."
2. "Smoke comes from the crater of the volcano, which has ended its eruption in the clear weather."
3. "The weather around the volcano turns cloudy."

Prompts

1. "A white dog is running in the beautiful meadow."
2. "A white dog is standing in the beautiful meadow."
3. "A white dog is yawning loudly in the beautiful meadow."
4. "A white dog lies on the ground in the beautiful meadow."

Prompts

1. "Santa Claus goes snowboarding on a snowy mountain."
2. "Santa Claus rides his sleigh through the snow in the mountain."
3. "Santa Claus walks through the forest to a frozen lake."
4. "Santa Claus has fun skating on the ice."

Prompts

1. "A Red Riding Hood girl walks in the woods."
2. "A Red Riding Hood girl sells matches in the forest."
3. "A Red Riding Hood girl falls asleep in the forest."
4. "A Red Riding Hood girl walks towards the lake from the forest."

Prompts

1. "The sea through the window."
2. "A ship passes by on the sea horizon through the window."
3. "Seagulls fly over the sea through the window."
4. "Camera movement towards the sea."
5. "A person is floating in the sea."
6. "A person gets on a surfboard and goes surfing."
7. "A person goes into the water."
8. "A person walks out onto the beach."
9. "A person walks on the beach."

2. VideoCrafter1 (512x1024)

Prompts

1. "An astronaut in a white uniform is snowboarding in the snowy hill."
2. "An astronaut in a white uniform is surfing in the sea."
3. "An astronaut in a white uniform is surfing in the desert."

Prompts

1. "A white dog is running in the beautiful meadow."
2. "A white dog is standing in the beautiful meadow."
3. "A white dog is yawning loudly in the beautiful meadow."
4. "A white dog lies on the ground in the beautiful meadow."

Ablation Study

Prompts

There is a beach where there is no one.
→ The waves hit the deserted beach.
→ There is a beach that has been swept away by waves.

DDIM Inversion

with LFAI

with DN

with DN,LFAI

Ours
(w.DN,LFAI,SGS)

Applications

1. Image and Multi-text-based Video Generation

Prompts

A single white flower gradually blooms from a single green flower bud.
→ The single white flower is blooming.
→ A lovely fully blossomed single white flower.

Input Image

Input Image

Output

Prompts

People walks on the beach at night.
→ There are sand castles on the beach under the fireworks at night.
→ Very few people remain on the beach at night and they gradually fade away.

Input Image

Input Image

Output

2. Video Generation with Large Language Model (LLM)

Original Scenario

"In the morning, Albert Einstein was walking in the forest, later he read a book under a tree, and as night fell, he walked towards the lake, eventually sitting near it in the forest at night."

Prompts as a result of LLM

Albert Einstein is walking in the forest in the morning.
→ Albert Einstein reads a book under a tree.
→ Albert Einstein walks from the forest towards the lake as night falls.
→ Albert Einstein sits near the lake in the forest at night.

Output

Original Scenario

"A man embarks on a motorcycle journey, runs through a traffic jam on a busy road, rides a motorcycle in the desert, walks in the desert at night, and looks at the sky with aurora in the desert."

Prompts as a result of LLM

A man embarks on a motorcycle journey.
→ A man runs through a traffic jam on a busy road.
→ A man rides a motorcycle in the desert.
→ A man walks in the desert at night.
→ A man looks at the sky with aurora in the desert.

Output

The used codes and license

URLLicense
https://github.com/YingqingHe/LVDM/MIT
https://github.com/AILab-CVC/VideoCrafter(Hugging Face Space) MIT

BibTeX

@article{oh2023mtvg,
      title={MTVG: Multi-text Video Generation with Text-to-Video Models},
      author={Oh, Gyeongrok and Jeong, Jaehwan and Kim, Sieun and Byeon, Wonmin and Kim, Jinkyu and Kim, Sungwoong and Kwon, Hyeokmin and Kim, Sangpil},
      journal={arXiv preprint arXiv:2312.04086},
      year={2023}
    }