I Tested 5 LLM APIs for Latency — Here's the Real Data (March 2026)

Kunal Ganglani March 7, 2026 6 min read

Abstract purple lines on a black background

I Tested 5 LLM APIs for Latency — Here's the Real Data (March 2026)

597 milliseconds. That's how fast Claude Haiku 4.5 delivered its first token on a medium-length prompt. Meanwhile, GPT-4.1 Mini took almost 4x longer to start streaming on the same test. If you're building anything user-facing on top of LLM APIs, that gap isn't a rounding error. It's the difference between an app that feels alive and one that feels broken.

I've been shipping LLM-powered features for the past two years, and the single most common question I get from other engineers is: which API is actually the fastest? Not which model is smartest. Not which one has the best reasoning. Which one won't make my users stare at a spinner.

The problem is that most "benchmarks" floating around are either vendor marketing, single-run anecdotes, or tests from six months ago when everything looked completely different. So I ran my own. Five models, three providers, three prompt sizes, three runs each. Here's every number.

The Test Setup

I tested five models across three providers:

Claude Sonnet 4 (claude-sonnet-4-6) — Anthropic's flagship reasoning model
Claude Haiku 4.5 (claude-haiku-4-5) — Anthropic's speed-optimized model
GPT-4.1 — OpenAI's latest full-size model
GPT-4.1 Mini — OpenAI's lightweight variant
Gemini 2.5 Flash — Google's speed-focused model

Each model got three prompt sizes: short (~50 tokens, "Explain what an API is in two sentences"), medium (~200 tokens, comparing REST vs GraphQL), and long (~500 tokens, a practical guide to rate limiting in Node.js). Three runs per prompt size, all from the same machine, all using streaming APIs for accurate TTFT measurement. Gemini's thinking mode was disabled to keep the comparison fair.

I tracked three metrics that actually matter for production systems: TTFT (time to first token — how long until the user sees something), total latency (full response time), and throughput (tokens per second during generation).

Three runs per configuration isn't a massive sample. I'm upfront about that. But it's enough to reveal clear performance tiers and catch obvious outliers. The p95 values tell you about consistency, which honestly matters more than the median in most production systems.

Short Prompts: Haiku Destroys the Field

Model	TTFT (ms)	Latency (ms)	Tokens/sec	Output Tokens
Claude Haiku 4.5	639 (p95: 742)	952 (p95: 1251)	62	~60
GPT-4.1	889 (p95: 1749)	1770 (p95: 2254)	42.4	~68
Gemini 2.5 Flash	1753 (p95: 2405)	2021 (p95: 2532)	32.7	~64
Claude Sonnet 4	1946 (p95: 2358)	2902 (p95: 3202)	20	~58
GPT-4.1 Mini	2205 (p95: 4004)	3541 (p95: 4844)	24.9	~77

For quick interactions, Claude Haiku 4.5 is in a league of its own. A 639ms TTFT and sub-second total latency (952ms) means the response feels nearly instant. Compare that to GPT-4.1 Mini at 3.5 seconds total. That's not a subtle difference. Users will feel it.

Medium Prompts: The Throughput Story Changes

The real surprise here is GPT-4.1 outperforming GPT-4.1 Mini on every single metric. Mini is supposed to be the fast, cheap option. On short prompts, the full GPT-4.1 was faster to first token (889ms vs 2205ms), faster to complete (1770ms vs 3541ms), and produced higher throughput. I ran these tests multiple times because I was convinced I had a bug. I didn't.

GPT-4.1 Mini's p95 TTFT hit 4004ms. Four seconds to see the first token in a worst case. In a chatbot, that's brutal.

Medium Prompts: The Throughput Story Changes

Model	TTFT (ms)	Latency (ms)	Tokens/sec	Output Tokens
Claude Haiku 4.5	597 (p95: 612)	3954 (p95: 4130)	78.9	~299
Gemini 2.5 Flash	730 (p95: 752)	1729 (p95: 1966)	146.5	~263
Claude Sonnet 4	1042 (p95: 1191)	7616 (p95: 7809)	42.4	~327
GPT-4.1 Mini	1523 (p95: 2094)	5771 (p95: 6062)	55.8	~328
GPT-4.1	1696 (p95: 2037)	5562 (p95: 6065)	50	~292

This is where things get interesting. Claude Haiku 4.5 still wins TTFT (597ms, incredibly consistent with a p95 of just 612ms), but Gemini 2.5 Flash takes total latency by a mile: 1729ms vs Haiku's 3954ms. How? Raw throughput. Gemini is pushing 146.5 tokens per second, nearly double Haiku's 78.9.

So Haiku starts talking first, but Gemini finishes talking first. For a streaming chat interface, the user sees Haiku's response begin sooner. But for batch processing, API pipelines, or anything where you care about total wall-clock time, Gemini 2.5 Flash is the clear winner at medium length.

I've shipped enough features to know that this distinction is the one most teams get wrong. They optimize for TTFT because it's the metric that "feels" fast in a demo, then wonder why their batch jobs take twice as long as expected.

Long Prompts: Gemini's Throughput Is Absurd

Model	TTFT (ms)	Latency (ms)	Tokens/sec	Output Tokens
Claude Haiku 4.5	610 (p95: 843)	7574 (p95: 8113)	135.2	~1024
Claude Sonnet 4	1216 (p95: 4288)	20445 (p95: 22549)	50.1	~1024
GPT-4.1	1670 (p95: 1833)	16900 (p95: 19871)	63	~1090
Gemini 2.5 Flash	1885 (p95: 2014)	6485 (p95: 6713)	173	~1146
GPT-4.1 Mini	2501 (p95: 2609)	18075 (p95: 19635)	62.2	~1138

173 tokens per second. Gemini 2.5 Flash at long prompts is generating tokens so fast that it finishes a ~1146-token response in 6.5 seconds. GPT-4.1 Mini, producing a similar token count (~1138), takes 18 seconds. Nearly 3x slower.

Haiku keeps its TTFT crown at 610ms and has respectable throughput here too (135.2 tok/s). But look at Claude Sonnet 4's p95 TTFT: 4288ms. That variance is a problem. A p95 that's 3.5x the median means roughly one in twenty requests is going to feel dramatically slower than average. Having built systems that deal with unpredictable latency spikes, I can tell you: high p95 variance is what generates user complaints. Not the median.

The other thing that jumped out: GPT-4.1 and GPT-4.1 Mini performed almost identically on throughput (63 vs 62.2 tok/s) despite Mini being the supposedly speed-optimized variant. At long output lengths, OpenAI's models seem to hit a throughput ceiling around 62-63 tok/s regardless of model size. That's... not great for Mini's value proposition.

What This Means for Your Architecture

Metric	Winner	Value
Fastest TTFT	Claude Haiku 4.5 (Medium)	597ms
Lowest Latency	Claude Haiku 4.5 (Short)	952ms
Highest Throughput	Gemini 2.5 Flash (Long)	173 tok/s

Here's the thing nobody's saying about LLM latency: the "fastest" model depends entirely on what you're building.

Jakob Nielsen at Nielsen Norman Group established the canonical response time thresholds decades ago: 0.1 seconds feels instantaneous, 1.0 second keeps the user's flow uninterrupted, and at 10 seconds you're losing their attention entirely. With streaming, TTFT is what determines perceived responsiveness. The user sees text start flowing and feels like the system is working.

So here's how I'd actually pick a model based on these results:

Chat interfaces and interactive tools? Claude Haiku 4.5, no contest. Sub-600ms TTFT with rock-solid consistency (the p95 barely budges from the median). After shipping multiple chat-based features where users are staring at the screen waiting for a response, I can confirm: consistent fast TTFT beats raw throughput for user satisfaction every time.

Batch processing, summarization pipelines, background jobs? Gemini 2.5 Flash. When nobody's watching a cursor blink, you want maximum throughput. 173 tok/s at long outputs means your pipeline runs finish faster and you burn fewer compute-seconds. The math is simple.

Quality-sensitive tasks that need a bigger model? GPT-4.1 is a reasonable middle ground. Not the fastest at anything, but TTFT consistently under 1.7 seconds with decent throughput. Claude Sonnet 4 produces great output, but that 20-second total latency on long prompts is hard to stomach unless quality absolutely demands it.

If you're using GPT-4.1 Mini hoping for speed: stop. Seriously. The full GPT-4.1 matched or beat it on almost every metric I tested. Mini's only edge was slightly higher token output on some prompts, which suggests it's more verbose, not more capable. Unless pricing is your primary constraint, GPT-4.1 is the better pick within OpenAI's lineup.

The Fastest Model Is the One That Fits Your Problem

These numbers will shift. Providers update their infrastructure constantly. I ran these benchmarks in early March 2026 from a single location, and your results from a different region or under different load will vary. Treat this as a snapshot, not a permanent ranking.

But the structural insight holds: TTFT and throughput are different races, and most models are optimized for one or the other. The real engineering decision isn't "which model is fastest." It's "which kind of fast do I need?"

If I had to make one prediction: the gap between TTFT leaders and throughput leaders will narrow over the next 6-12 months as providers optimize their inference stacks. Anthropic is clearly investing in TTFT (Haiku's consistency is remarkable). Google is betting on raw throughput (Gemini Flash's token generation is in a different tier). And OpenAI seems stuck in the middle, with Mini failing to deliver on its speed promise.

The real question is whether any of them break the 200ms TTFT barrier at scale. That's the threshold where LLM responses would feel truly instantaneous. We're not there yet. But 597ms is a lot closer than the 2-3 seconds we were seeing a year ago.

Benchmark your own workloads. Don't trust anyone else's numbers. Including mine.

Photo by Logan Voss on Unsplash.

#llm #api #benchmarks #latency #ai #performance

Written by Kunal Ganglani

Software engineering leader based in Toronto. Building intelligent systems at the intersection of AI and practical software architecture.

Share this post

Share on X LinkedIn Reddit Hacker News

Stay in the loop

Get new posts on AI, engineering, and emerging tech — no spam, unsubscribe anytime.

Or subscribe via RSS

Abstract streaks of purple and pink lights on black background

GPT-5.4 Is Here? No. But Here's What Developers Actually Need to Know About GPT-5

There is no GPT-5.4. But OpenAI's next flagship model is coming. Here's a grounded, developer-focused breakdown of what GPT-5 will actually change for your work.

A close up of a computer chip in a dark room

Apple's M5 Max Just Made the Case for Local AI Development. NVIDIA Should Pay Attention.

128GB of unified memory at 614 GB/s in a laptop. The M5 Max isn't just a faster chip — it's a completely different approach to running large language models locally.

Why AI Latency Matters More Than Intelligence: The 232ms Lesson From GPT-4o

Everyone's obsessed with making AI smarter. The real breakthrough is making it faster. GPT-4o's 232ms response time changes what AI can actually be.

I Tested 5 LLM APIs for Latency — Here's the Real Data (March 2026)

The Test Setup

Short Prompts: Haiku Destroys the Field

Medium Prompts: The Throughput Story Changes

Long Prompts: Gemini's Throughput Is Absurd

What This Means for Your Architecture

The Fastest Model Is the One That Fits Your Problem

Stay in the loop

Related Posts

GPT-5.4 Is Here? No. But Here's What Developers Actually Need to Know About GPT-5

Apple's M5 Max Just Made the Case for Local AI Development. NVIDIA Should Pay Attention.

Why AI Latency Matters More Than Intelligence: The 232ms Lesson From GPT-4o