Just as Apple and Google are turning their voice assistants into chatbots, OpenAI is turning their chatbots into voice assistants.
On Monday, the San Francisco artificial intelligence startup announced a new version of its ChatGPT chatbot that can receive and respond to voice commands, images, and videos.
The company says the new app is based on an AI system called GPT-4o that can process audio, images, and video significantly faster than previous versions of the technology. The app will be available for free on both smartphones and desktop computers starting Monday.
“We are looking to the future of ourselves and our interactions with machines,” said Mira Murati, the company's chief technology officer.
The new app is part of a broader effort to combine conversational chatbots like ChatGPT with voice assistants like Google Assistant and Apple's Siri. As Google integrates its Gemini chatbot with Google Assistant, Apple is preparing a new version of Siri that will be more conversational.
OpenAI said it will gradually share the technology with users “over the coming weeks.” This is the first time ChatGPT is being offered as a desktop application.
The company previously offered similar technology within a variety of free and paid products. Now they are combined into a single system available for all products.
During the event, which was streamed over the internet, Murati and her colleagues showed off a new app that responds to conversational voice commands, uses a live video feed to analyze math problems written on paper, and uses the app's I read out a playful story about the story. I wrote it on the spot.
New apps cannot generate videos. However, you can generate still images that represent frames of a video.
With the debut of ChatGPT in late 2022, OpenAI showed that machines process requests just like humans. You can also respond to conversational text prompts to answer questions, write periodic reports, and generate computer code.
ChatGPT was not driven by a set of rules. The technology acquired its skills by analyzing vast amounts of text culled from across the internet, including Wikipedia articles, books, and chat logs. Experts touted the technology as a potential replacement for search engines like Google and voice assistants like Siri.
The new version of the technology also learned from sounds, images, and videos. Researchers call this “multimodal AI.” Basically, companies like OpenAI have started combining chatbots with their AI image, audio, and video generators.
(The New York Times sued OpenAI and its partner Microsoft in December, alleging copyright infringement of news content related to its AI systems.)
Many hurdles remain when companies combine chatbots and voice assistants. Because chatbots learn their skills from internet data, they are prone to mistakes. In some cases, they can be completely fabricated information. This is a phenomenon that AI researchers call “hallucination.” These flaws are also affecting voice assistants.
Chatbots can generate persuasive words, but they are less adept at scheduling meetings or booking airplane flights. But companies like OpenAI are working to transform them into “AI agents” that can reliably handle these tasks.
OpenAI previously offered a version of ChatGPT that could accept voice commands and respond with voice. But it was a patchwork of three different AI technologies. One that converts speech to text, one that generates a text response, and one that converts that text to synthetic speech.
The new app is based on a single AI technology, GPT-4o, that can accept and generate text, audio, and images. This means the technology becomes more efficient and the company can afford to provide it to users for free, Murati said.
“Previously, there were delays as a result of three models working together,” Murati said in an interview with the New York Times. “You want an experience like the one we have, where you can have a very natural interaction.”