I Tested 5 LLM APIs for Latency — Here's the Real Data (March 2026)
I Tested 5 LLM APIs for Latency — Here's the Real Data (March 2026)
597 milliseconds. That's how fast Claude Haiku 4.5 delivered its first token on a medium-length prompt. Meanwhile, GPT-4.1 Mini took almost 4x longer to start streaming on the same test. If you're building anything user-facing on top of LLM APIs, that gap isn't a rounding error. It's the difference between an app that feels alive and one that feels broken.

I've been shipping LLM-powered features for the past two years, and the single most common question I get from other engineers is: which API is actually the fastest? Not which model is smartest. Not which one has the best reasoning. Which one won't make my users stare at a spinner.
The problem is that most "benchmarks" floating around are either vendor marketing, single-run anecdotes, or tests from six months ago when everything looked completely different. So I ran my own. Five models, three providers, three prompt sizes, three runs each. Here's every number.
The Test Setup
I tested five models across three providers:

- Claude Sonnet 4 (
claude-sonnet-4-6) — Anthropic's flagship reasoning model - Claude Haiku 4.5 (
claude-haiku-4-5) — Anthropic's speed-optimized model - GPT-4.1 — OpenAI's latest full-size model
- GPT-4.1 Mini — OpenAI's lightweight variant
- Gemini 2.5 Flash — Google's speed-focused model
Each model got three prompt sizes: short (~50 tokens, "Explain what an API is in two sentences"), medium (~200 tokens, comparing REST vs GraphQL), and long (~500 tokens, a practical guide to rate limiting in Node.js). Three runs per prompt size, all from the same machine, all using streaming APIs for accurate TTFT measurement. Gemini's thinking mode was disabled to keep the comparison fair.
I tracked three metrics that actually matter for production systems: TTFT (time to first token — how long until the user sees something), total latency (full response time), and throughput (tokens per second during generation).
Three runs per configuration isn't a massive sample. I'm upfront about that. But it's enough to reveal clear performance tiers and catch obvious outliers. The p95 values tell you about consistency, which honestly matters more than the median in most production systems.
Short Prompts: Haiku Destroys the Field
| Model | TTFT (ms) | Latency (ms) | Tokens/sec | Output Tokens |
|---|---|---|---|---|
| Claude Haiku 4.5 | 639 (p95: 742) | 952 (p95: 1251) | 62 | ~60 |
| GPT-4.1 | 889 (p95: 1749) | 1770 (p95: 2254) | 42.4 | ~68 |
| Gemini 2.5 Flash | 1753 (p95: 2405) | 2021 (p95: 2532) | 32.7 | ~64 |
| Claude Sonnet 4 | 1946 (p95: 2358) | 2902 (p95: 3202) | 20 | ~58 |
| GPT-4.1 Mini | 2205 (p95: 4004) | 3541 (p95: 4844) | 24.9 | ~77 |
For quick interactions, Claude Haiku 4.5 is in a league of its own. A 639ms TTFT and sub-second total latency (952ms) means the response feels nearly instant. Compare that to GPT-4.1 Mini at 3.5 seconds total. That's not a subtle difference. Users will feel it.

The real surprise here is GPT-4.1 outperforming GPT-4.1 Mini on every single metric. Mini is supposed to be the fast, cheap option. On short prompts, the full GPT-4.1 was faster to first token (889ms vs 2205ms), faster to complete (1770ms vs 3541ms), and produced higher throughput. I ran these tests multiple times because I was convinced I had a bug. I didn't.
GPT-4.1 Mini's p95 TTFT hit 4004ms. Four seconds to see the first token in a worst case. In a chatbot, that's brutal.
Medium Prompts: The Throughput Story Changes
| Model | TTFT (ms) | Latency (ms) | Tokens/sec | Output Tokens |
|---|---|---|---|---|
| Claude Haiku 4.5 | 597 (p95: 612) | 3954 (p95: 4130) | 78.9 | ~299 |
| Gemini 2.5 Flash | 730 (p95: 752) | 1729 (p95: 1966) | 146.5 | ~263 |
| Claude Sonnet 4 | 1042 (p95: 1191) | 7616 (p95: 7809) | 42.4 | ~327 |
| GPT-4.1 Mini | 1523 (p95: 2094) | 5771 (p95: 6062) | 55.8 | ~328 |
| GPT-4.1 | 1696 (p95: 2037) | 5562 (p95: 6065) | 50 | ~292 |
This is where things get interesting. Claude Haiku 4.5 still wins TTFT (597ms, incredibly consistent with a p95 of just 612ms), but Gemini 2.5 Flash takes total latency by a mile: 1729ms vs Haiku's 3954ms. How? Raw throughput. Gemini is pushing 146.5 tokens per second, nearly double Haiku's 78.9.
So Haiku starts talking first, but Gemini finishes talking first. For a streaming chat interface, the user sees Haiku's response begin sooner. But for batch processing, API pipelines, or anything where you care about total wall-clock time, Gemini 2.5 Flash is the clear winner at medium length.
I've shipped enough features to know that this distinction is the one most teams get wrong. They optimize for TTFT because it's the metric that "feels" fast in a demo, then wonder why their batch jobs take twice as long as expected.
Long Prompts: Gemini's Throughput Is Absurd
| Model | TTFT (ms) | Latency (ms) | Tokens/sec | Output Tokens |
|---|---|---|---|---|
| Claude Haiku 4.5 | 610 (p95: 843) | 7574 (p95: 8113) | 135.2 | ~1024 |
| Claude Sonnet 4 | 1216 (p95: 4288) | 20445 (p95: 22549) | 50.1 | ~1024 |
| GPT-4.1 | 1670 (p95: 1833) | 16900 (p95: 19871) | 63 | ~1090 |
| Gemini 2.5 Flash | 1885 (p95: 2014) | 6485 (p95: 6713) | 173 | ~1146 |
| GPT-4.1 Mini | 2501 (p95: 2609) | 18075 (p95: 19635) | 62.2 | ~1138 |
173 tokens per second. Gemini 2.5 Flash at long prompts is generating tokens so fast that it finishes a ~1146-token response in 6.5 seconds. GPT-4.1 Mini, producing a similar token count (~1138), takes 18 seconds. Nearly 3x slower.
Haiku keeps its TTFT crown at 610ms and has respectable throughput here too (135.2 tok/s). But look at Claude Sonnet 4's p95 TTFT: 4288ms. That variance is a problem. A p95 that's 3.5x the median means roughly one in twenty requests is going to feel dramatically slower than average. Having built systems that deal with unpredictable latency spikes, I can tell you: high p95 variance is what generates user complaints. Not the median.
The other thing that jumped out: GPT-4.1 and GPT-4.1 Mini performed almost identically on throughput (63 vs 62.2 tok/s) despite Mini being the supposedly speed-optimized variant. At long output lengths, OpenAI's models seem to hit a throughput ceiling around 62-63 tok/s regardless of model size. That's... not great for Mini's value proposition.
What This Means for Your Architecture
| Metric | Winner | Value |
|---|---|---|
| Fastest TTFT | Claude Haiku 4.5 (Medium) | 597ms |
| Lowest Latency | Claude Haiku 4.5 (Short) | 952ms |
| Highest Throughput | Gemini 2.5 Flash (Long) | 173 tok/s |
Here's the thing nobody's saying about LLM latency: the "fastest" model depends entirely on what you're building.
Jakob Nielsen at Nielsen Norman Group established the canonical response time thresholds decades ago: 0.1 seconds feels instantaneous, 1.0 second keeps the user's flow uninterrupted, and at 10 seconds you're losing their attention entirely. With streaming, TTFT is what determines perceived responsiveness. The user sees text start flowing and feels like the system is working.
So here's how I'd actually pick a model based on these results:
Chat interfaces and interactive tools? Claude Haiku 4.5, no contest. Sub-600ms TTFT with rock-solid consistency (the p95 barely budges from the median). After shipping multiple chat-based features where users are staring at the screen waiting for a response, I can confirm: consistent fast TTFT beats raw throughput for user satisfaction every time.
Batch processing, summarization pipelines, background jobs? Gemini 2.5 Flash. When nobody's watching a cursor blink, you want maximum throughput. 173 tok/s at long outputs means your pipeline runs finish faster and you burn fewer compute-seconds. The math is simple.
Quality-sensitive tasks that need a bigger model? GPT-4.1 is a reasonable middle ground. Not the fastest at anything, but TTFT consistently under 1.7 seconds with decent throughput. Claude Sonnet 4 produces great output, but that 20-second total latency on long prompts is hard to stomach unless quality absolutely demands it.
If you're using GPT-4.1 Mini hoping for speed: stop. Seriously. The full GPT-4.1 matched or beat it on almost every metric I tested. Mini's only edge was slightly higher token output on some prompts, which suggests it's more verbose, not more capable. Unless pricing is your primary constraint, GPT-4.1 is the better pick within OpenAI's lineup.
The Fastest Model Is the One That Fits Your Problem
These numbers will shift. Providers update their infrastructure constantly. I ran these benchmarks in early March 2026 from a single location, and your results from a different region or under different load will vary. Treat this as a snapshot, not a permanent ranking.
But the structural insight holds: TTFT and throughput are different races, and most models are optimized for one or the other. The real engineering decision isn't "which model is fastest." It's "which kind of fast do I need?"
If I had to make one prediction: the gap between TTFT leaders and throughput leaders will narrow over the next 6-12 months as providers optimize their inference stacks. Anthropic is clearly investing in TTFT (Haiku's consistency is remarkable). Google is betting on raw throughput (Gemini Flash's token generation is in a different tier). And OpenAI seems stuck in the middle, with Mini failing to deliver on its speed promise.
The real question is whether any of them break the 200ms TTFT barrier at scale. That's the threshold where LLM responses would feel truly instantaneous. We're not there yet. But 597ms is a lot closer than the 2-3 seconds we were seeing a year ago.
Benchmark your own workloads. Don't trust anyone else's numbers. Including mine.
Photo by Logan Voss on Unsplash.


