Why AI Latency Matters More Than Intelligence: The 232ms Lesson From GPT-4o

Kunal Ganglani March 4, 2026 36 min read

Why AI Latency Matters More Than Intelligence: The 232ms Lesson From GPT-4o

232 milliseconds. That's how fast GPT-4o can respond to an audio input. Not a text query where you're already staring at a loading spinner. An actual human voice asking a question and getting an answer back before the conversational beat even feels awkward.

For context, GPT-4's voice mode used to take an average of 5.4 seconds. GPT-3.5 was 2.8. Those numbers don't sound catastrophic until you actually try to have a conversation with something that pauses for five seconds after every sentence. It doesn't feel like talking to a colleague. It feels like talking to someone on a terrible satellite phone connection in 2003.

The AI industry is locked in an arms race over benchmark scores. Every new model announcement leads with "X% improvement on MMLU" or "surpasses human performance on Y." And look, intelligence matters. I'm not arguing it doesn't. But I think we're collectively ignoring something more fundamental: the moment AI gets fast enough, it stops being a tool you use and becomes a thing you interact with. Those are different categories. And the gap between them matters way more than the next two points on a reasoning benchmark.

The Speed That Changes the Category

Jakob Nielsen identified three critical response time thresholds back in 1993, and they've held up remarkably well. Under 0.1 seconds, the system feels instantaneous. Under 1 second, the user's flow of thought stays uninterrupted. Over 10 seconds, you've lost them entirely.

These thresholds were defined for graphical interfaces, but they map cleanly onto conversational AI. Research on human turn-taking (Stivers et al., PNAS 2009) shows that the average gap between conversational turns across languages is roughly 200 milliseconds. That's the rhythm humans expect from dialogue. It's baked deep.

GPT-4o's average response time of 320ms, with a floor of 232ms, lands right in that zone. It's not just "fast for an AI model." It's fast enough to feel conversational.

When voice mode took 5.4 seconds with GPT-4, every interaction was a query. You asked, you waited, you got an answer. It was a sophisticated search engine with a microphone. At 320ms, you can interrupt it. It can react to your tone. The whole dynamic shifts from "I am using a tool" to "I am talking to something." That's not a marginal improvement. It's a different product.

How OpenAI Actually Got There

The architecture story behind GPT-4o's speed is more interesting than most people realize. And it has nothing to do with making GPUs go brrr.

The Use Cases That Only Exist Below 500ms

Before GPT-4o, OpenAI's voice mode was a Rube Goldberg machine. Your voice hit Whisper for speech-to-text. That transcription got fed to GPT-3.5 or GPT-4 for reasoning. Then the text output got piped through a separate text-to-speech model. Three models, chained sequentially. Every handoff added latency. Every handoff lost information. By the time GPT-4 "heard" your voice, it had already been flattened into text. Tone, emotion, hesitation, emphasis. Gone.

GPT-4o's fix was architectural unification. The "o" stands for "omni" because the model processes text, vision, and audio end-to-end in a single neural network. No chain. No handoffs. Audio in, audio out, reasoning natively across modalities.

This is the kind of boring infrastructure decision that has enormous downstream consequences. It's not a flashy new training technique or a bigger parameter count. It's the decision to stop duct-taping specialized models together and build one that handles everything. The speed improvement is almost a side effect of getting the architecture right.

I've shipped enough features to know this pattern. Nobody writes blog posts about removing a network hop. But removing a network hop is often worth more than any amount of algorithmic cleverness stacked on top of a bad foundation.

When the architecture is right, speed comes almost for free. When it's wrong, no amount of optimization saves you.

The Use Cases That Only Exist Below 500ms

Here's what I keep coming back to: there's an entire class of applications that simply cannot exist above a certain response time. Not "work poorly." Cannot exist.

Real-time translation is the obvious one. If you're sitting across from someone who speaks a different language and your AI translator takes 3 seconds per utterance, the conversation is dead. The social rhythm collapses. Both people end up staring at their phones waiting. At 300ms, the translation becomes invisible. The conversation just flows. This isn't a better version of slow translation. It's a product that couldn't exist before.

Live coding assistance is another. I use AI coding tools daily, and the difference between a 2-second suggestion and a 200ms suggestion isn't comfort. It's whether the tool fits into my flow state or shatters it. At 200ms, the AI feels like autocomplete. At 2 seconds, it feels like waiting for a build. One of those I use a hundred times a day. The other I use when I'm stuck and already out of flow anyway.

Gaming is one people aren't talking about enough. Believable NPCs need to respond at conversational speed or faster. A character that pauses for two seconds before answering every question doesn't feel intelligent. It feels like a loading screen with a face. Sub-300ms responses make dynamic, conversational game characters viable for the first time. That's a big deal for an industry that's been faking NPC conversations with branching dialogue trees for thirty years.

And then there's the one that keeps me up at night: real-time fraud detection on voice calls. Social engineering attacks, deepfake voice scams. This stuff is going to become a massive problem as voice cloning gets cheaper. An AI that can analyze a live call and flag anomalies needs to operate at conversational speed. If it takes 5 seconds to process each utterance, the scammer has already gotten what they need.

None of these are incremental improvements to existing products. They're new categories that only unlock below a latency threshold. The capability was always theoretically possible. The speed is what makes it real.

The Benchmark Trap

The AI industry has a measurement problem, and it's the oldest one in engineering: we optimize for what we can measure. Benchmark performance is easy to measure. MMLU scores, HumanEval pass rates, reasoning test accuracy. These numbers go into papers, press releases, and Twitter threads.

Latency? Cost per token? Reliability at scale? Footnotes. If they're mentioned at all.

This creates a distorted picture of what progress actually looks like. A model that scores 2% higher on a reasoning benchmark but takes three times longer to respond will win the headline and lose the user. I've watched this happen in production. The "smarter" model gets swapped out for the faster one within weeks because users care about the experience of using the thing, not its score on a test they'll never see.

Mira Murati basically said as much when GPT-4o launched: the new model matched GPT-4 Turbo's performance on text and code while being substantially faster and adding native multimodal capabilities. The message was clear. Holding the line on intelligence while dramatically improving speed and efficiency is a legitimate strategy. Maybe the more important one right now.

This is one of those things where the boring answer is actually the right one. The next meaningful leap in AI adoption won't come from a model that's 10% smarter on benchmarks. It'll come from a model that's fast enough, cheap enough, and reliable enough to be embedded everywhere without anyone thinking about it. Speed is the multiplier that turns a research demo into a product people actually use.

What Comes Next

If GPT-4o brought latency from 5.4 seconds down to 320ms, the next generation of models will push toward consistent sub-200ms responses across modalities. At that point, you stop thinking of AI as a separate system you invoke. It's just there. Responding at the speed of thought. Woven into every interaction instead of sitting behind a text box.

I think the companies that win the next phase of AI aren't going to be the ones with the highest scores on increasingly esoteric benchmarks. They're going to be the ones that figure out how to deliver good-enough intelligence at imperceptible latency, at a cost that makes it viable to run on every interaction. Not just the expensive ones.

If you're building on top of these models, pay as much attention to the latency column in the spec sheet as the accuracy column. Probably more. Your users will never see a benchmark score. But they will absolutely feel a 3-second pause. And they won't come back to tell you why they left.

Photo by Artur Rekstad on Unsplash.

#OpenAI #GPT-4o #AI Latency #Real-Time AI #LLM

Share this post

Share on X LinkedIn

Stay in the loop

Get new posts on AI, engineering, and emerging tech — no spam, unsubscribe anytime.

Or subscribe via RSS

Written by Kunal Ganglani

Software engineering leader based in Toronto. Building intelligent systems at the intersection of AI and practical software architecture.

Apple's M5 Max Just Made the Case for Local AI Development. NVIDIA Should Pay Attention.

128GB of unified memory at 614 GB/s in a laptop. The M5 Max isn't just a faster chip — it's a completely different approach to running large language models locally.

AI Coding Agents Won't Replace You. But They Will Replace How You Think About Code.

Everyone's asking if AI will replace engineers. That's the wrong question. The real shift is in what 'writing code' even means anymore.

Anthropic Said No to the Pentagon. OpenAI Said Yes. Now What?

Two AI companies made opposite bets on military collaboration, and the market just picked a side.

Why AI Latency Matters More Than Intelligence: The 232ms Lesson From GPT-4o

The Speed That Changes the Category

How OpenAI Actually Got There

The Use Cases That Only Exist Below 500ms

The Benchmark Trap

What Comes Next

Stay in the loop

Related Posts

Apple's M5 Max Just Made the Case for Local AI Development. NVIDIA Should Pay Attention.

AI Coding Agents Won't Replace You. But They Will Replace How You Think About Code.

Anthropic Said No to the Pentagon. OpenAI Said Yes. Now What?