With the unveiling of this text-to-video generator, Shengshu Technology and Tsinghua University have demonstrated their commitment to pushing the boundaries of AI technology.
This partnership highlights the growing importance of AI research and development in China and its potential impact on various industries worldwide.
China's Next Step in AI Innovation
Shengshu Technology and Tsinghua University's joint venture, Vidu, represents a significant milestone in China's AI innovation journey.
This collaboration brings together the expertise of a tech startup and an esteemed academic institution to create a cutting-edge text-to-video generator.
With Vidu's unveiling at the Zhongguancun Forum in Beijing, it has garnered attention as a noteworthy competitor to OpenAI's Sora.
Unlike Sora's longer 60-second video capability, Interesting Engineering reported that Vidu allows users to generate shorter yet high-definition 16-second video clips with just a single click.
While Vidu's functionality may seem limited compared to Sora, its introduction marks a significant step forward in China's AI technology landscape.
As the country continues to invest in AI research and development, Vidu exemplifies China's commitment to innovation and technological advancement.
Zhu Jun, the chief scientist at Shengshu and deputy dean at Tsinghua's Institute for AI, described Vidu as a significant advancement in self-reliant innovation, boasting breakthroughs in various domains.
Vidu is characterized by its imaginative capabilities, ability to simulate the physical world, and capacity to generate 16-second videos with consistent characters, scenes, and timelines.
Furthermore, Zhu highlighted Vidu's proficiency in understanding "Chinese elements." During the model's debut, Shengshu Technology presented several demonstrations, including scenarios such as a panda playing a guitar on grass and a puppy swimming in a pool.
Advancements in Vidu's Architectural Framework
Vidu is constructed on a proprietary visual transformation model architecture called the Universal Vision Transformer (U-ViT). Developers have indicated that this architecture combines two text-to-video AI models: the Diffusion and the Transformer.
Furthermore, this architectural framework facilitates the creation of lifelike videos featuring dynamic camera movements, intricate facial expressions, and authentic lighting and shadow effects.
Zhu noted that the introduction of Sora resonated with their technical direction, intensifying their resolve to continue their research efforts.
Contrary to many Chinese iterations of OpenAI's ChatGPT that emerged in November 2020, Chinese competitors have only recently caught up to Sora's capabilities.
Experts in the industry attribute this delay to the significant challenge of insufficient computing power for Chinese companies.
According to Li Yangwei, a Beijing-based technical consultant specializing in intelligent computing, running Sora requires eight NVIDIA A100 graphics processing units (GPUs) for over three hours to generate a one-minute video clip.
Yangwei notes that Sora demands extensive computing power for inferencing.