Our Initial Impressions of ChatGPT's Advanced Voice Mode

It's finally here, and it's (mostly) as impressive as we expected it would be.

Sep 25, 2024

Midjourney prompt: *A series of speech bubbles in a conversational layout, with the bubbles seamlessly morphing from sound wave patterns to text and back*

OpenAI first announced ChatGPT’s Advanced Voice mode in May, as part of the larger announcement for its then-new model, GPT-4o. Since then, like many others, we’ve been patiently waiting to try Advanced Voice for ourselves. We finally got access today and want to share our early impressions here in Confluence. Before we get to that, though, we want to let readers know that we have a few spots left in next week’s virtual seminar on generative AI for communication professionals. The seminar is two half-days, on Monday, September 30 and Tuesday, October 1. If you would like to learn more or join us, you may do so here. With that, let’s get back to ChatGPT’s Advanced Voice mode.

Until now, interacting with ChatGPT via voice relied on a three-step process: first, it would transcribe your voice to text, then it would write a response to that text, and finally it would convert its response to audio. It was useful — as we’ve noted before — but limited. And the biggest limitation was latency: the three-step process took time, creating enough of a delay to make it feel somewhat unnatural and clunky to use.

With Advanced Voice mode, that delay has almost entirely disappeared. Our first and most important impression is that Advanced Voice mode is fast. It still doesn’t perfectly match the rhythm of a human conversation, but it’s close. As Ethan Mollick noted when he shared his impressions from his early access, “Working with ChatGPT via voice seems like talking to a person.” The speed is a big part of that.

Another piece of this more natural conversational dynamic is the ability to interrupt or redirect the conversation. It feels seamless, especially relative to how it worked in the past. In the transcript below, ChatGPT gives us a list of the limitations of Advanced Voice mode, and we interrupt it to ask a clarifying question:

Before Advanced Voice mode, this simple interruption wouldn’t have been possible. Attempting it would likely have derailed (and certainly would have slowed down) the conversation. The exchange above took about four seconds, and the conversation then naturally resumed.

The third positive impression we’ll note is the ability of Advanced Voice mode to recognize — and employ — variations in intonation. Whereas the previous voice mode often sounded like a robot reading a script (which, in many ways, it was), these new subtle variations in tone result in something that feels much more natural.

These individual improvements are largely in line with what we expected based on OpenAI’s announcements and the reflections shared by Mollick and others who had early access. What’s perhaps hardest to express about these changes — and also most impressive — is “the sum of the parts”: how it all adds up to create a more engaging and natural conversational experience, much closer to how we’d interact with another person. We’ve noticed that, as a result of these collective improvements to the experience, we engage differently (that is, more naturally) in the conversation ourselves. In our limited use of Advanced Voice mode thus far, we’ve found ourselves asking more questions and feeling more inclined to continue the conversation.

We expected Advanced Voice mode to be a major leap forward in capabilities, and it is. But it’s not without limitations, a few of which are worth noting here. The first — which we allude to in the screenshot above — is that ChatGPT cannot access the internet in Advanced Voice mode. Another major limitation is its inability to engage with or employ some of the other features that make ChatGPT so valuable, including custom GPTs. There are certainly others, but those two have stood out to us so far. It’s worth remembering that these limitations are likely to be temporary — both those mentioned above and others we’ll surely come across.

ChatGPT’s Advanced Voice mode is the second “wow” moment we’ve written about in less than a week. The other was our experience with Google NotebookLM’s new audio generation feature. Both are impressive examples of the improvement in generative AI audio capabilities, which is likely not a coincidence. We noted in May that “The ‘o’ in GPT-4o stands for ‘omni,’ a nod to the fact that this new model is truly multi-modal, capable of processing text, audio, images, and videos, almost in real time.” Neither GPT-4o nor any other model has reached omni-modality yet (at least, not in a form that’s available to the public), but both Advanced Voice mode and NotebookLM demonstrate that progress on the audio front shows no signs of slowing down. Time will tell, but it’s reasonable to assume we can expect to see the same in video and, eventually, in the integration of these modalities into a single experience. In the meantime, there’s plenty of utility to gain from using what’s available to us today. Our advice, as always, is to try for yourself and see.

AI Disclosure: We used generative AI in creating imagery for this post. We also used it selectively as a creator and summarizer of content and as an editor and proofreader.

Confluence: AI, Leadership, and Communication

Discussion about this post