Revolutionary Speed in Voice Response! OpenAI Unveils New Flagship Model GPT-4o

NBD

At 1 AM Beijing time on Tuesday, OpenAI, which had not brought any surprises to the market since the “Sora video model” earlier this year, held its spring conference. Mira Murati, the company’s Chief Technology Officer, showcased several updates related to ChatGPT. In brief, OpenAI’s conference accomplished two main things: the release of the latest GPT-4o multimodal large model, which is faster and cheaper than GPT-4 Turbo.

The second announcement was that free users of ChatGPT will now have access to the newly released GPT-4o model (previously only GPT-3.5 was available) for tasks such as data analysis, image analysis, internet searches, and accessing the app store. This means that developers of the GPT app store will face a massive influx of new users.

Of course, paid users will receive a higher message limit (OpenAI states at least 5 times more). When free users exhaust their message quota, ChatGPT will automatically switch to GPT-3.5.

Additionally, OpenAI plans to introduce an improved voice experience based on GPT-4o to Plus users in about a month, as the current GPT-4o API does not include voice functionality. Apple computer users will welcome a ChatGPT desktop application designed for macOS, which allows users to “capture” their desktop with a shortcut key and ask questions to ChatGPT. OpenAI indicates that a Windows version will be released later this year.

It’s worth mentioning that Mira Murati stated in a live event, “This is the first time we have truly taken a big step forward in terms of ease of use.”

Photo/Video screenshot

Supported by Microsoft, OpenAI is currently valued by investors at over 80 billion dollars. Founded in 2015, the company is now under pressure to maintain its leading position in the generative AI market while also finding ways to profit, as it has invested heavily in processors and infrastructure to build and train its models.

Real-time Translation

Mira Murati emphasized the necessary safety of GPT-4o’s real-time voice and audio functions, stating that OpenAI will continue to deploy iterations to bring all functionalities.

In a demonstration, Mark Chen, OpenAI’s research director, pulled out his phone, opened ChatGPT in Voice Mode, and sought advice from ChatGPT supported by GPT-4o. The voice of GPT sounded like an American woman, and when it heard Chen’s excessive breathing, it seemed to detect his nervousness. Then it said, “Mark, you’re not a vacuum cleaner,” telling Chen to relax his breathing. If there are significant changes, users can interrupt GPT, and the delay of GPT-4o typically should not exceed two to three seconds.

Photo/Video screenshot

In another demonstration, Barret Zoph, head of OpenAI’s post-training team, wrote an equation (3x+1=4) on a whiteboard. ChatGPT gave him hints, guided him through each step of the solution, recognized his writing results, and helped him solve for the value of (x). During this process, GPT acted as a real-time math teacher. GPT was able to recognize mathematical symbols, even a heart shape.

Photo/Video screenshot

Responding to a user request on social media X, Mira Murati spoke Italian to ChatGPT live. GPT then translated her words into English and relayed them to Zoph and Chen. After hearing Murati’s Italian, GPT translated it into English for Chen: “Mark, she (Mira Murati) wants to know if whales can talk, and what they would tell us?”

Photo/Video screenshot

OpenAI claims that GPT-4o can also detect human emotions. In the demonstration, Zoph held his phone in front of his face, asking ChatGPT to tell him what he looked like. Initially, GPT referred to a previous photo he shared and identified him as a “wooden surface.” After a second attempt, GPT provided a better answer.

GPT noticed the smile on Zoph’s face and said, “You seem very happy, all smiles.” Some comments suggested that this demonstration showed ChatGPT can read human emotions, but reading them is still somewhat challenging.

Photo/Video screenshot

OpenAI executives stated that GPT-4o can interact with code repositories and demonstrated its ability to draw conclusions from a global temperature chart based on what it saw. OpenAI announced that ChatGPT’s text and image input functions based on GPT-4o would go live this Monday, with voice and video options to be released in the coming weeks.

According to data from PitchBook cited by foreign media, in 2023, nearly 700 generative AI deals injected a record 29.1 billion dollars, growing over 260% from the previous year. Predictions suggest that this market will surpass 1 trillion dollars in revenue within the next decade. Some industry insiders express concern about the rapid market introduction of untested new services, while academics and ethicists worry about the technology’s tendency to propagate biases.

Since its launch in November 2022, ChatGPT has broken the record for the fastest-growing consumer application at the time, with active users now nearing 100 million per week. OpenAI reports that over 92% of Fortune 500 companies are using the platform.

Photo/Video screenshot

At Monday’s event, Murati expressed OpenAI’s desire to “remove some of the mystique from technology.” She also said, “In the coming weeks, we will roll out these features to everyone.”

At the end of the live event, Murati thanked Nvidia CEO Jensen Huang and his company for providing the necessary Graphics Processing Units (GPUs) that power OpenAI’s technology. She said, “I just want to thank the outstanding OpenAI team, as well as Jensen Huang and the Nvidia team for bringing us the most advanced GPUs, making today’s demonstration possible.

Responding to audio input in as fast as 232 milliseconds

OpenAI’s official website introduces GPT-4o, where the ‘o’ stands for the prefix ‘omni,’ signifying a step towards more natural human-computer interaction. It accepts any combination of text, audio, and image as input and generates any combination of text, audio, and image outputs.

Photo/Video screenshot

Besides the API being faster and significantly cheaper, OpenAI also mentioned that GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average response time of 320 milliseconds, similar to human response time in conversation. It matches the performance of GPT-4 Turbo in English text and code and has significantly improved performance in non-English text.

OpenAI explains that compared to existing models, GPT-4o excels particularly in vision and audio understanding. Previously, users conversing with ChatGPT in Voice Mode using GPT-3.5 and GPT-4 experienced average latencies of 2.8 seconds and 5.4 seconds, respectively, because OpenAI used three separate models for such conversations: one model transcribed audio to text, another received and output text, and a third converted the text back to audio. This process meant that GPT lost a lot of information; it could not directly observe tone, multiple speakers, or background noise, nor could it output laughter, singing, or express emotion.

However, GPT-4o’s voice conversation is the result of OpenAI training a new model end-to-end across text, vision, and audio, meaning all inputs and outputs are processed by the same neural network. OpenAI states that GPT-4o is their first model combining all these modalities, so they are still only beginning to explore the model’s capabilities and limitations.

Last week, there were reports that OpenAI would release an AI-based search product, but last Friday, OpenAI’s CEO Sam Altman denied such news, stating that the demonstration on Monday was neither GPT-5 nor a search engine. This means that OpenAI did not launch an AI search as speculated by the market timeline. Subsequently, media reported that OpenAI’s new product might be a brand new multimodal AI model with visual and auditory functions, possessing better logical reasoning capabilities than current chatbots.

Editor: Alexander

Revolutionary Speed in Voice Response! OpenAI Unveils New Flagship Model GPT-4o

Most Popular