The advent of OpenAI's Sora seems to grant AI the potential to truly see and understand the world. Officially, Sora is hailed as the "World Simulator": it comprehends the rules of reality and generates a "world" based on these principles.

This June, the large-scale video generation models have ushered in a new round of product explosions. Phenomenal hits such as Kuaishou KLing, Luma AI, and the updated Gen-3 Alpha from Runway continue to emerge. Among them, the most discussed is Kuaishou's self-developed video generation model "KLing," which is positioned as a comprehensive benchmark against Sora.

National Business Daily (NBD) fed KLing with the 5 official Sora video prompts released by OpenAI, testing its ability to generate videos from text, and compared the results from multiple dimensions such as dynamics, visual effects, details, and scenes.

After testing and observation, it was found that although Sora still leads in some dimensions, and KLing, being in the early stages of its open release, has relatively singular functions and certain limitations in generation, it can almost be determined that domestic large-scale video generation models have reached a new height.

Chen Zemin, the chief analyst of Internet Media at SinoLink Securities, said in an interview that, setting aside the technology, Kuaishou's "KLing" has achieved data support for a part of the absolutely true worldview, and AI's rapid understanding of different worldviews and its response in videos, "This is what I think is incredible."

NBD selected five official Sora video prompts from OpenAI (a lady on the streets of Tokyo, an astronaut, a coastal view from a drone perspective, a 3D animated monster, and a young person reading in the clouds) to test KLing. At the same time, the generated videos were tested for effects, and the results were summarized from multiple specific performances under the two dimensions of "screen presentation" and "function and experience."

In terms of screen presentation:

1.Dynamic Effects. KLing's "shots" generally progress in a forward and backward logic, while Sora's are more diverse. In the "drone view of waves crashing against cliffs" item, Sora's focus is on the "small island with a lighthouse" mentioned in the prompt, providing a panoramic view of the environment with an emphasis on the island. KLing's camera moves forward and backward, with the island set at the farthest end of the frame without highlighting it. However, both Sora and KLing accurately describe the trajectory of the waves.

Sora (above) and KLing (Below)

2. Visual Effects. Both Sora and KLing perform quite well in this category. Especially in the "lady on the streets of Tokyo" video, both models accurately express the neon lights' colors and the reflection of the wet ground.

Sora (above) and KLing (Below)

3. Detail Performance. The reporter paid special attention to the presentation of human facial features. Sora's depiction is more detailed, especially in dynamic scenes where the facial features do not distort and remain relatively static. In dynamic scenes, KLing's facial features may distort, twisting with the camera's progression and the subject's movements.

Sora (above) and KLing (Below)

4. Coherence and Smoothness. Both Sora and KLing's screen performances are coherent and smooth, but Sora is notably superior in describing complex scenes. Especially in the "astronaut" video generation, KLing only provides a close-up of the astronaut's front, while Sora presents a switch between distant and close views, adding more auxiliary elements such as spacecraft.

Sora (above) and KLing (Below)

In terms of function and experience, the styles of Sora and KLing's video generation both tend to simulate real-world scenes. Whether it's science fiction scenes, natural scenery, supernatural images, or 3D animations, they all focus more on a realistic style.

In terms of adaptation to different scenes, Sora's capabilities seem to be superior. In the generation of the "young person reading in the clouds" scene, KLing's texture is heavier, and the integration between materials is not high. In terms of semantic understanding, both Sora and KLing are relatively accurate, especially in capturing different subjects in the prompts, which are quite complete.

However, it should be noted that the videos currently generated by KLing are all 5 seconds long, compared to Sora's 10 to 20 seconds, which presents certain limitations in the narration of complex scenes. In the early stage of the launch, KLing's functions are more singular, and there are limitations in style switching.

In addition, NBD also found during the experience that sometimes "KLing" would "fail to work." For example, a panda playing the guitar has human fingers; the "light green fabric sofa" in the prompt turns out to be a red-brown leather sofa in the video. At the same time, in some videos, when there are multiple subjects, sometimes some elements cannot be fully presented in the video.

It is worth noting that the above KLing videos were generated by the reporter for testing, and different versions of the videos may vary. Sora is not yet open, and the videos generated are all official versions. After Sora is open for testing, the actual test results of users may also differ from the official released videos.

Editor: Gao Han