jemalloc vs malloc vs tcmalloc: Why Your Server's Default Allocator Is Killing P99 Latency

Kunal Ganglani March 17, 2026 6 min read

a row of blue shelves sitting on top of a blue floor

jemalloc vs malloc vs tcmalloc: Why Your Server's Default Allocator Is Killing P99 Latency

A few months ago, I was chasing a P99 latency spike on a multi-threaded service handling roughly 40,000 requests per second. The flame graphs pointed at an unusual suspect: malloc. Not a slow database query. Not a network timeout. The standard glibc memory allocator was holding a global lock, and threads were lining up behind it like cars at a single-lane toll booth.

I swapped in jemalloc with a single LD_PRELOAD change. P99 dropped 35%. No code changes, no architecture redesign. Just a better allocator.

This is one of those things where the boring answer is actually the right one. Most engineers never think about their memory allocator. They shouldn't have to. But if you're running multi-threaded server workloads at any real scale, the default allocator is leaving performance on the table.

The Problem With glibc malloc

glibc's malloc implementation (based on ptmalloc2) was designed when "multi-threaded" meant 2-4 threads. Its architecture relies on arenas, but the number of arenas is limited and the locking strategy is coarse-grained. When you have 32, 64, or 128 threads all allocating and freeing memory concurrently, threads contend on arena locks. That contention shows up directly as latency.

The real killer isn't average throughput. glibc malloc handles averages fine. The problem is the tail. Under contention, a thread can stall for hundreds of microseconds waiting for a lock. For a service with a 10ms P50, that pushes P99 to 50ms or worse. I've seen this pattern across multiple services over the years. The flame graph always tells the same story: __lll_lock_wait sitting near the top.

Then there's fragmentation. glibc's allocator is bad at returning memory to the OS for long-running processes. Over days or weeks, your service's RSS creeps upward even though its actual working set hasn't changed. This is especially painful for services running in containers with hard memory limits. You get OOM kills that look like memory leaks but are actually fragmentation. I've burned entire afternoons chasing those ghosts.

How jemalloc Solves This

Jason Evans created jemalloc in 2005 to solve exactly these problems for FreeBSD. The original paper laid out two core goals: scalable concurrency and fragmentation avoidance. Nearly two decades later, those goals remain the reason jemalloc exists.

The architecture is straightforward. Per the jemalloc documentation, jemalloc creates multiple arenas by default, typically 4x the number of CPU cores. Threads are assigned to arenas round-robin, so on a 16-core machine you get 64 arenas. The probability of two threads contending on the same arena drops dramatically.

But the piece that actually matters most is the thread-specific cache, or tcache. Each thread gets its own small cache of recently freed memory. When a thread needs to allocate a small object (the vast majority of allocations in most server workloads), it pulls from its own tcache. No lock. No arena access. The allocation is essentially a pointer bump. This eliminates locking for a huge percentage of allocation requests.

The best lock is the one you never have to acquire.

jemalloc also uses size classes chosen to minimize internal fragmentation. Instead of rounding up to the nearest power of two (which wastes up to 50% of memory), jemalloc uses a more granular set. The result: long-running services maintain a more predictable memory footprint.

Meta develops and maintains jemalloc, running it across their infrastructure from memcached and mcrouter to HHVM. Redis uses jemalloc as its default allocator. FreeBSD ships with it. Firefox relied on it for years. When that many production systems depend on a library, it's been stress-tested in ways that synthetic benchmarks never capture.

Where tcmalloc Fits In

Google's tcmalloc (Thread-Caching Malloc) attacks the concurrency problem from a similar angle. Per-thread caches, tiered allocation strategy. If you're familiar with how latency impacts real-world systems, the same principles apply: reducing contention at the allocation layer has cascading effects on tail latency.

tcmalloc is excellent. For raw allocation throughput in micro-benchmarks, it often matches or beats jemalloc. Google runs it across most of their C++ infrastructure, and the engineering quality shows.

Here's the tradeoff though. tcmalloc holds onto memory more aggressively. In my experience, services running tcmalloc carry a higher steady-state RSS compared to the same workload on jemalloc. For Google, where services run on Borg with sophisticated resource management, this is fine. For teams running in Kubernetes with fixed memory limits, that extra RSS can mean the difference between a stable service and OOM kills at 3 AM.

Fragmentation over time also favors jemalloc for long-running processes. tcmalloc's size classes and page management are optimized for throughput; jemalloc's are optimized for memory efficiency over time. Both are valid priorities.

Here's how I think about the decision:

glibc malloc: Fine for single-threaded or low-concurrency workloads. Already there. Don't overthink it.
jemalloc: Best for multi-threaded servers where you care about tail latency AND memory efficiency. The default choice for long-running services, in my opinion.
tcmalloc: Best when raw allocation throughput is your primary concern and you have real memory management at the orchestration layer. Not many teams outside Google actually do.

The Numbers That Actually Matter

I've shipped enough services to know that micro-benchmarks lie. A benchmark that allocates and frees millions of 64-byte objects in a tight loop tells you something about allocator overhead, but almost nothing about how your service behaves in production.

What matters for server workloads:

P99 latency under contention. This is where jemalloc wins decisively. When 64 threads are hammering allocations concurrently, glibc's P99 allocation time can be 10-50x higher than jemalloc's. The arena multiplier and tcache design effectively kill the long tail.

RSS stability over time. Run your service for 72 hours under realistic load. Check RSS. With glibc, I've seen RSS grow 30-40% above the actual working set due to fragmentation. jemalloc typically stays within 10-15%. tcmalloc lands somewhere in between, depending on allocation patterns.

Allocation throughput. All three handle millions of allocations per second per thread. This is rarely your bottleneck. If your benchmarks focus only on throughput, you're measuring the wrong thing.

The profiling story is also worth mentioning. jemalloc ships with built-in heap profiling that you can enable at runtime via mallctl or environment variables. You can inspect fragmentation, allocation patterns, and memory usage without attaching a separate profiler. Having debugged production systems under pressure, I can tell you this kind of introspection is the difference between a 20-minute fix and a 4-hour investigation.

How to Actually Switch

The barrier to trying jemalloc is close to zero. On most Linux systems, you can test it without recompiling anything.

Install the package (apt install libjemalloc2 on Debian/Ubuntu), then launch your service with LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2. That's it. Every malloc call in your process now goes through jemalloc.

For a permanent switch, link against it at build time with -ljemalloc. One-line change in most build systems.

A few things I've learned the hard way:

Tune the arena count. The default (4x CPUs) works for most cases, but if you have hundreds of threads, bump it via MALLOC_CONF. I had to do this on a service with 200+ threads before I saw the full benefit.
Enable background threads. jemalloc can return memory to the OS asynchronously with background_thread:true in MALLOC_CONF. Helps RSS stability without touching request latency.
Don't mix allocators. If you LD_PRELOAD jemalloc but some shared library statically links its own allocator, you'll have a bad time. Check your dependencies. I learned this one the painful way.
Profile first. jemalloc's built-in profiling shows you exactly where your memory is going. Use it before and after the switch.

For those working on performance-sensitive AI inference stacks, the allocator choice matters even more. LLM serving involves massive, concurrent tensor allocations across threads. That's exactly the workload where jemalloc's design pays the biggest dividends.

Stop Ignoring Your Allocator

Memory allocation is infrastructure. Like DNS or TLS termination, it's invisible until it isn't. The difference is that swapping your allocator is one of the highest-leverage, lowest-risk performance changes you can make to a server application.

If you're running multi-threaded services on glibc malloc and you haven't benchmarked an alternative, you're leaving 10-30% of your tail latency on the table. That's not a guess. That's a pattern I've seen across half a dozen services over the past few years.

jemalloc won't fix algorithmic inefficiencies or bad architecture. But it will stop your memory allocator from being the thing that wakes you up at night. We obsess over shaving milliseconds off database queries and network hops while ignoring the allocator layer entirely. That's the easiest win most teams never pick up.

Try it on one service. Measure before and after. Let the numbers decide.

Photo by Clark Van Der Beken on Unsplash.

#jemalloc #performance-engineering #systems-programming #c-plus-plus #malloc

Written by Kunal Ganglani

Software engineering leader based in Toronto. Building intelligent systems at the intersection of AI and practical software architecture.

Share this post

Share on X LinkedIn Reddit Hacker News

Stay in the loop

Get new posts on AI, engineering, and emerging tech — no spam, unsubscribe anytime.

Or subscribe via RSS

OpenAI Killed Codex in 2023. Then They Brought It Back. Here's What That Tells Us.

The strange two-act story of OpenAI's Codex reveals exactly where AI coding tools are headed next.

The M5 MacBook Air Is the Default Developer Machine. Stop Overthinking It.

Apple's M5 MacBook Air has a 10-core CPU, 16GB base RAM, and 512GB storage at the same price point. For most developers, the Pro no longer makes sense.

jemalloc vs malloc vs tcmalloc: Why Your Server's Default Allocator Is Killing P99 Latency

The Problem With glibc malloc

How jemalloc Solves This

Where tcmalloc Fits In

The Numbers That Actually Matter

How to Actually Switch

Stop Ignoring Your Allocator

Stay in the loop

Related Posts

OpenAI Killed Codex in 2023. Then They Brought It Back. Here's What That Tells Us.

The M5 MacBook Air Is the Default Developer Machine. Stop Overthinking It.