Google's AI Tool Imagen Lets You Generate High-Res Videos from a Text-Prompt

The tool can produce videos with a maximum resolution of 1280768 at a frame rate of 24 fps.

AI-generated artwork has been on the rise lately. Tools such as DALL-E, MidJourney, and Stable Diffusion are already changing the landscape of art, as more people can generate digital artworks with mere text prompts.

But what happens if this text-to-image generation trend levels up to videos? What if you type the prompt: "A cow jumps over the moon", and get a motion clip of this text?

Perhaps, we can go even more epic with the "Flying through an intense battle between pirate ships in a stormy ocean."

Thanks to Google's video-generating AI tool, these prompts can now be transformed into a motion picture.

Imagen Video

Google's Imagen Video, a text-to-video generative AI model that can create high-definition videos from text input, was only announced on Oct. 5.

The text-conditioned video diffusion model is capable of producing videos with a maximum resolution of 1280768 at a frame rate of 24 fps, as reported first by VentureBeat.

In its recently released paper, "Imagen Video: High definition generation with diffusion models," Google says that Imagen Video has a high degree of controllability and world knowledge and can produce videos with high fidelity.

The generative model can produce a variety of films and text animations in various aesthetic styles, interpret 3D, and render and animate text. The model is now in a research phase, but its introduction comes just five months after Imagen highlighted the quick development of synthesis-based models.

Imagen Video includes an interleaved spatial and temporal super-resolution diffusion model, a basic video diffusion model, and a text encoder (frozen T5-XXL). According to Google, this design was created using the knowledge gained from past research on diffusion-based image generation.

The study team also incorporated progressive distillation for quick, high-quality sampling into the video models with no direction from classifiers.

The text-conditional video production, spatial super-resolution, and temporal super-resolution functions of the video generation framework are carried out via a cascade of seven sub-video diffusion models.

The entire cascade produces high-definition 1280768 films at 24 frames per second for 128 frames or roughly 126 million pixels.

Among the model's many impressive creative skills are its ability to create videos inspired by the paintings of well-known artists like Vincent van Gogh, display spinning objects in 3D while maintaining their structure, and render text in a variety of animation styles.

Is Imagen Video Available for the Public?

Since generative models may be misused for generating harmful content, Google said that it has taken several actions to allay these concerns. The company confirmed through internal tests that it implemented input text prompt filtering and output video content filtering.

However, Google issued a warning that there are still several significant ethical and safety issues that need to be resolved.

Hence, the company has not yet publicly released the model since they will still have to work on these concerns and alleviate potential risks.

This article is owned by Tech Times

Written by Joaquin Victor Tacla

ⓒ 2024 TECHTIMES.com All rights reserved. Do not reproduce without permission.
Join the Discussion
Real Time Analytics