Video diffusion models, a sophisticated field of generative models, play a pivotal role in synthesizing videos from text descriptions. Despite remarkable advances in similar areas such as ChatGPT When using Midjourney for text and Midjourney for images, video generation models often struggle with temporal consistency and natural dynamics. To address these challenges, Nanyang Technological University S-Lab researchers developed FreeInit, a pioneering model designed to significantly improve video quality by bridging the gap between the training and inference phases of video diffusion models.
FreeInit works by orchestrating the noise initialization process, a critical step in video creation. Existing models use Gaussian noise in both the training and inference phases. However, this method causes the video to lack temporal consistency due to the uneven frequency distribution of the initial noise. FreeInit innovatively solves this problem by iteratively improving the spatial-temporal low-frequency components of the initial noise. This method requires no additional training or learnable parameters and integrates seamlessly into existing video diffusion models during inference.
FreeInit’s core technique is reinitializing noise to reduce the training-inference gap. It starts with independent Gaussian noise and goes through a denoising process to produce a clean video potential. The generated video potentials then undergo forward diffusion, resulting in noisy potentials with improved temporal coherence. These noisy latent elements are combined with the high-frequency components of random Gaussian noise to generate reinitialized noise, which serves as the starting point for a new sampling iteration. This process significantly improves the temporal consistency and visual appearance of the generated video.
Extensive experiments were conducted to verify the effectiveness of FreeInit and applied to various text-to-video models such as AnimateDiff, ModelScope, and VideoCrafter. The results were surprising, with the temporal consistency metric improving from 2.92 to 8.62. Qualitative and quantitative improvements were evident across a variety of text prompts, demonstrating the versatility and effectiveness of FreeInit in improving video generation models.
By making FreeInit publicly available, researchers encouraged its widespread use and further development. Integrating FreeInit into current video creation models can significantly advance the field of video creation, bridging a critical gap that has long needed to be addressed in this area.
Image source: Shutterstock