640.thumb_head

Perhaps, the only one who can defeat OpenAI is OpenAI itself.

At the end of 2022, OpenAI launched the artificial intelligence chatbot ChatGPT, which started the "arms race" in the field of large models. On February 15, 2024, with the advent of the video generation model Sora, OpenAI once again made a splash.

Sora takes video generation content to a new level, with its realistic video effects refreshing the perception of AI capability boundaries. Soar, like an underwater bomb, instantly detonated the global technology community.

Many industry insiders said that the arrival of Sora represents a substantive leap. Pan Hui, a foreign academician of the Royal Academy of Engineering and an academician of the European Academy of Sciences, said in an interview with NBD, "Sora is currently absolutely invincible in terms of video generation quality. Videos generated by Sora can switch from close-ups to panoramic views with different camera positions."

It is worth noting that the text-to-video generation model is not a completely new field. To better present Sora's text-to-video capabilities, NBD used the 5 Sora video prompts officially released by OpenAI to test the text-to-video capabilities of Pika, Runway and PixVerse, and compared the generated results with Sora videos. The test scenarios involved 5 categories, including character close-ups, movie trailers, etc.

Comparison results show that Sora has obvious advantages in terms of length, coherence and visual details, which is almost a "dimensionality reduction strike" to its rivals.

From ChatGPT to Sora, why can OpenAI continuously create "game-changing products"?

SIY.Z, a PhD in Computer Science at UC Berkeley and a Zhihu blogger, analyzed that "if I have to use one word to show the core technology of OpenAI, I think it is scaling law - how to ensure that the larger the model and the more data, the better the effect." From the text generation model GPT, the text-to-image model DALL·E, to the text-to-video model Sora, OpenAI may have already created its own AGI general technology route.

Five scenarios tested: Sora achieves “dimensional reduction” strikes in four aspects including duration

On February 15, 2024, OpenAI released Sora, the first large-scale text-to-video model. The demo video quickly went viral and sparked heated discussions in the industry. Some netizens even wailed, "I'm going to be unemployed."

Yinye, CEO of BGI Group, wrote in an article, "From this moment on, a digital twin world that can fit more real-world physical laws has entered human society. I would like to compare it to the beginning of the Newtonian era of AI development."

What makes Sora’s video generation capabilities stand out?

Since Sora has not opened for public testing, NBD used the five Sora video prompts officially released by OpenAI to test the effects of similar models Runway, Pika and Pixverse in five scenarios: street, cartoon animation, character close-up, animal close-up and movie trailer. The results showed that Sora had a dominating advantage in terms of video length, coherence and visual details, over its rivals. 

Pan Hui, an international fellow of the Royal Academy of Engineering and a member of the European Academy of Sciences, also said in an interview with NBD, “Sora’s core advantage is that it can generate high-definition long videos. In terms of clarity or duration, it is currently the best. OpenAI is more focused on photo-realistic technology, and it may be too early to discuss whether it will lead a new wave, but Sora is unrivalled in video generation quality.”

However, it should be noted that the comparison of Sora with other models is based on a limited number of prompts and scenes. The results of text-to-video models may be random. It is possible that OpenAI selected the best video from multiple generations of Sora.

(1) Longer video duration

Compared with Runway, Pika, and PixVerse, Sora generates videos with an average length of nearly 16 seconds, up to 20 seconds, while the other three models produce videos that are only 3-4 seconds long. Sora can generate videos up to a minute long, which allows Sora to present the video content more completely, making it more suitable for making short films, advertisements, and other applications.

(2) Stronger video coherence

Sora generates videos with seamless transitions, natural camera movements, and smooth character animations, enhancing the overall viewing experience. In contrast, the videos made by other models often have problems such as scene changes, unsmooth frames, etc., affecting the viewing experience.

Pan Hui said, “Sora can change the perspective of the video. Sora generates videos that can switch from close-ups to panoramic views, changing different angles, but ensuring the characters/objects in the picture. At the same time, the objects in Sora’s videos have strong consistency. Consistency has always been challenging, and Sora performs well in this aspect.”

(3) Richer visual details

In addition, NBD found that Sora generates videos with rich visual details, clear object textures, realistic colours, and higher overall video quality. In contrast, the videos generated by other models are usually blurry, lack details, and have less vivid colours.

For example, in the generated video of “a woman blinking her eyes”, Sora’s close-up of the woman’s eyes is very accurate. From the details of the eyebrows, eyelashes, eyelid wrinkles, eye bags, tear troughs, and fine lines, it has achieved a realistic effect.

(4) More suitable for different scenarios

From the presentation effects of the above five different scenarios, it is not difficult to see that Sora can clearly meet the needs of different creators.

China Fortune Securities research report said that Sora’s core technology is based on OpenAI’s years of efforts in natural language processing and image generation. AI video generation is not a new thing, but Sora is expected to increase the popularity of AI multimodal, which will in turn vitalize the content consumption market.

From GPT to Sora, OpenAI connects the AGI technology stack

The vividness and coherence of Sora’s videos are truly amazing, and the two core breakthroughs that help Sora achieve a leap in capability are:

First, at the bottom layer architecture, Sora adopts the Diffusion Transformer (DiT, or Diffusive Transformer) architecture.

OpenAI’s text models, such as GPT-4, use the Transformer model, while traditional text-to-video models are usually diffusion models (Diffusion Model). Sora’s DiT architecture integrates GPT and traditional diffusion model architectures.

From the Sora technical report published on OpenAI’s official website, we can find that the theoretical basis of Sora’s DiT architecture is an academic paper titled Scalable diffusion models with transformers. The paper was co-authored by William (Bill) Peebles, a researcher at the University of California, Berkeley and now the technical leader of the Sora team, and Xie Saining, a researcher at New York University, in December 2022.

After Sora was released, Xie Saining wrote on the X platform, “When Bill and l were working on the DiT project, instead of creating novelty (see my last tweet s, ), we prioritized two aspects: simplicity and scalability. These priorities offer more than just conceptual advantages.."

Photo/X

Secondly, Spacetime Patch is also one of the core innovations of Sora. On this point, Sora’s design is consistent with GPT-4.

Patch can be understood as Sora’s basic unit, Patch is a segment of video. A video can be understood as different Patches organized in a certain sequence. GPT-4’s basic unit is just like a Token, and the Token is a fragment of text. GPT-4 is trained to process a string of Tokens and predict the next Token. Sora follows the same logic, it can process a series of Patches and predict the next Patch in the sequence.

Pan Hui explained to NBD, “By turning video data into small pieces (patches), the model can understand images like text. Referring to the past performance of GPT, GPT has a very good semantic understanding of text. Applying the same principle to video, it can increase the flexibility of data and the final expression ability of the model.”

Photo/Sora tech report

SIY.Z, a PhD in computer science from the University of California, Berkeley and a Zhihu author, wrote on Zhihu, “If I had to use one word to show OpenAI’s core technology, I think it is scaling law - that is, how to ensure that the bigger the model, the more data and the better the results. In one sentence, Sora’s contribution is that under sufficient data, high-quality annotations, and flexible encoding, the scaling law continues to hold on the transformer + diffusion model architecture.”

In his view, data, annotation, encoding, and underlying architecture are all derived from the previous successful experience of large models. Xie Saining also mentioned on the X platform that Sora has two key points that have not been mentioned, one is about the source and construction of training data, and the other is about the technical details of (autoregressive) long video generation.

It can be said that for OpenAI, which is currently all in AGI, from the text generation model GPT, the text-to-image model DALL·E, to the text-to-video model Sora, OpenAI may have created its own AGI general technology route.

It is worth noting that the Sora route, which is built on the previous successful experience, may become the new paradigm for the next text-to-video models. As early as January, a former Alibaba AI expert said on the X platform, “I think the Transformer framework and the LLM route will be a breakthrough and a new paradigm for AI video, which will make AI video more coherent, consistent, and longer. The current Diffusion+Unet route (such as Runway, Pika, etc.), is only a temporary solution.”



$80 billion! OpenAI valuation doubled in nine months

From the chatbot ChatGPT, to the text-to-image model DALL·E, to the recent text-to-video model Sora, OpenAI has been under the spotlight of the capital market.

In fact, the text-to-video giant model is not a new track. Many text-to-video giant models have been around, such as Stability AI’s Stable Video Diffusion, Runway’s Gen-2 Video, Google’s Lumiere, Meta’s Make-A-Video, Pika and PixVerse, etc.

Among them, Pika caused a global sensation after officially releasing Pika 1.0 in November last year. Its founder Guo Wenjing also became famous, and Pika 1.0 was also called the strongest competitor of Runway Gen-2. However, after Sora became popular, the map of the text-to-video field might have to be rewritten.

On the one hand, there is a technological gap. Diffusion Transformer and Spacetime Patch are not new technologies, but only OpenAI has successfully launched Sora. In addition, from the actual comparison of text-to-video generation, Sora has indeed achieved dimensionality reduction.

On the other hand, in terms of valuation and financing scale, OpenAI, backed by Microsoft, is arguably the leader among AI startups. The reason why its products can shock the industry with their super strong iteration ability as soon as they are released is probably inseparable from the money burning behind them.

Following the popularity of Sora, OpenAI's valuation has reportedly skyrocketed to over $80 billion. It is worth noting that the company's valuation has tripled in less than 9 months.

In addition to various AI large model products, OpenAI CEO Sam Altman has also set his sights on the semiconductor field. According to reports, Altman is in contact with various stakeholders, including potential investors, semiconductor manufacturers and energy suppliers, and is expected to raise $700 million to build a chip empire.

With the support of technology and capital, OpenAI is likely to continue to lead the way in the coming years.

In comparison, Runway has raised over $250 million in total funding to date. TechCrunch reported that Runway's valuation reached $1.5 billion in late June last year, and its investors include Google, Nvidia and Salesforce.

Regulation badly needed

Many people believe that the advent of Sora could change a range of creative industries, from filmmaking and advertising to graphic design, Game development, social media, influencer marketing, and even educational technology are all expected to be affected.

"The most direct impact will be on the video production industry. Whether it is common scene or many dangerous scenes, they can be done by AI. This will greatly change the logic of video creation and lower the threshold for video creation. People who don't have video shooting skills can also become excellent video creators through their imagination." Pan Hui said.

He also told NBD that Sora and other similar AI video models have shown great commercial potential and market demand in multiple industries. "Industries including media and entertainment, banking, financial services and insurance, retail, and healthcare will benefit greatly from the advancement of generative AI. These technologies can not only optimize marketing and sales activities, improve customer service, but also strengthen product development and risk management."

Pan Hui also said, "The transformative potential of generative AI in these areas has shown high market demand and huge cross-industry economic value, potentially creating $2.6 trillion to $4.4 trillion in value for various industries."

In addition, according to foreign media reports, Hemant Mohapatra, a partner at Lightspeed India, compared the emergence of Sora to the opening of Pandora's box, which will change everything. "The quality of the videos it generates is so high that it will immediately threaten stock video generation companies."

The price shedding in the secondary market can already corroborate this statement. The day after Sora's release, the stock price of American computer software company Adobe plummeted by more than 7%; Shutterstock, an American provider of stock images, footage, music, and editing tools, fell by more than 5%; and Google's parent company, which released the "Vincent video" tool Lumiere a few weeks ago, fell by 1.58%. The three companies lost a combined market value of nearly $48 billion in a single day.

On the other hand, with the rapid development of AI, discussions about its risks have never stopped. How to prevent its abuse or misuse and how to avoid its negative impact on people's cognition are the focus of many experts in the industry.

"Video generation is easily abused. Many places have face recognition devices and video generation makes past technologies no longer safe. To mitigate the risks that these technologies may bring, it is crucial to establish sound ethical guidelines, implement strict data privacy measures, and ensure transparency in the development and use of AI models." Pan Hui told NBD.

With the development of AI, countries are also committed to regulating AI. As early as October of last year, the White House issued its first executive order on AI, which will establish comprehensive regulatory standards for the research and development and application of AI. In November of last year, representatives from China, the United States, the United Kingdom, the European Union, and other parties signed the Blachely Declaration at the First Global Artificial Intelligence Safety Summit.

Pan Hui believes that in the future, the focus may shift to enhancing AI capabilities while ensuring that they are developed and used ethically and responsibly, to maximize their positive impact on various industries. "AI video models are moving towards more responsible AI practices, (and there is a need) to invest in R&D to enhance the safety and security of AI applications. It is possible to take a proactive approach to address these social and ethical issues."

Editor: Alexander