Allocators: Memory Retention and Performance

This post explains how the memory allocator — a library you've probably never configured — can silently inflate your memory footprint and your bills.

Your application layer frees almost all the memory it allocates, but you still see a staircase in your RAM monitoring graph. That's because the freed memory is not immediately returned to the operating system: it can stay readily available to your application code for longer than you may expect, which is what makes the allocator choice matter here. To show this, we ran one identical Python web application with 6 different allocators. The leanest settled at 1099 MB; the heaviest peaked at 2306 MB — more than double, while latency stayed similar across all.

1 · What an allocator actually does

Your program constantly needs scratch memory: a list grows, a request gets parsed, a buffer fills. When it needs memory it calls malloc; when it's done it calls free. But malloc and free don't talk to the operating system on every call, the overhead would be too much.

Instead, a library called the allocator sits between your program and the OS. It grabs memory from the kernel in big chunks, then hands out small pieces from those chunks as your code calls malloc, taking them back into its own pool on free.

Here's the part that surprises people who are abstracted away from allocators: when application code frees memory, the allocator can still keep it. In that case, it holds the freed space to satisfy your next malloc instantly, rather than handing it back to the kernel. So a long-running service can free nearly everything and still look "fat" to the OS — the memory is sitting idle inside the allocator's pool, reserved for reuse.

Every program already has an allocator whether you chose one or not. On a typical Linux Python image it's glibc's built-in ptmalloc2. You can swap it for another — jemalloc, tcmalloc, mimalloc — by setting one environment variable (LD_PRELOAD) at startup. No code changes. That's what makes this measurable: same binary, same traffic, only the allocator changes.

The one idea to hold onto. Freeing memory in your code does not mean giving it back to the OS. The allocator decides that, on its own schedule, by its own policy. Different allocators make that call differently — and that difference is what we're about to measure.

2 · Why memory climbs: fragmentation

If the allocator just kept a flat pool and reused it, footprint would plateau and we'd have no story. The reason it climbs upward instead is fragmentation.

Picture the allocator's pool as a long shelf, filled left to right. Some buffers are short-lived (a parsed request, a serialized response); some stick around for the whole process lifetime (caches, router tables, connection state). When the short-lived ones are freed, they leave gaps — but those gaps sit between long-lived buffers that are still in place.

The allocator usually can only hand a whole region back to the OS when everything in that returnable unit is free.

A shelf the allocator can't give back

Schematic. The freed holes are real free memory, but scattered between live buffers. The allocator can't return a region with even one survivor in it, so the holes stay mapped — counted against your RSS.

This is the RSS staircase: each burst of traffic carves new holes that get pinned under survivors, and total resident memory ratchets up step by step — even though much of it is technically free. The workload in this study is built to provoke exactly this: a /churn endpoint that keeps a large, bounded set of mixed-size (4–64 KB) buffers alive and frees a random subset on each call, so freed holes are reliably trapped beneath survivors. That's the fragmentation engine the allocators have to cope with.

Anonymous memory over time — stock defaults · the RSS staircase

mimalloc-bun mimalloc-v3 glibc tcmalloc jemalloc mimalloc-v2 all none

Every series is one allocator under the identical 200 req/s load, sampled once a second. They all climb the same staircase as fragmentation traps freed holes under survivors, then diverge: mimalloc-v2 (top) keeps the most, the modern mimalloc forks (bottom) the least. The shaded IDLE band on the right is where background return timers fire — watch jemalloc and mimalloc-v2 shed their bulge while glibc barely moves.

QUESTIONS we ask:

- Given a workload that fragments, how much does each allocator's restock-or-refund policy actually cost in resident memory?

- Does buying lower memory cost you anything in speed?

3 · The experiment

One synthetic FastAPI web service, four uvicorn workers, no per-request recycling (recycling would mask the very fragmentation we're measuring). It's hit with a constant 200 req/s of mixed traffic for ten minutes, then left to idle. The only thing that changes between runs is the allocator.

The six allocators

- glibc ptmalloc2
- jemalloc 5.3
- tcmalloc (gperftools)
- mimalloc v2
- mimalloc v3
- mimalloc v3 bun fork

What's measured

Anonymous memory from memory.stat — the only component the allocator governs. (memory.current mixes in page cache and would mask the effect.)
p50/p95/p99 latency and throughput from vegeta.

Peak vs. settled

Peak = high-water mark during load.
Settled = anonymous memory after a 120 s idle phase, once the allocator's background return timers have had a chance to fire.

Confounds pinned

- Swap off
- transparent huge pages off
- CPU governor fixed
- load generator pinned to separate cores
- page cache dropped between runs.
The big memory confounds are held constant

A note on "settled." We also fire a forced memory.reclaim at the end, but with swap off it returns essentially nothing — the kernel can't evict dirty anonymous pages with nowhere to put them. So the real normalizer is the idle phase: memory is returned before the reclaim call, either during load (by proactive allocators) or during idle (by decay timers). "Settled" means "after the allocator's own return mechanisms have run," not "after we forced a reclaim." That distinction matters, and getting it wrong is an easy methodology trap.

4 · The headline: six allocators ranked

Here is the result with each allocator on its stock defaults
Faint bar = peak during load
solid bar = settled after idle
Sorted leanest-settled first.

Anonymous memory — stock defaults · peak → settled (MB)

1mimalloc-bun

1099peak 1282

2mimalloc-v3

1101peak 1382

3glibc

1209peak 1387

4tcmalloc

1213peak 1216

5jemalloc

1514peak 1897

6mimalloc-v2

1711peak 2306

Solid = settled (after 120 s idle); faint = peak (during load). Bars scaled to the largest value (mimalloc-v2 peak, 2306 MB). The settled spread runs from 1099 to 1711 MB — a 612 MB difference produced by nothing but the choice of allocator.

Interpretation:

1. The leanest settled allocators are the modern mimalloc forks (bun and v3, ~1100 MB), with the default glibc surprisingly competitive (1209).
2. The bars tell a richer story than a single ranking: look at how differently each one moves from peak to settled. tcmalloc's two bars are almost identical (1216 → 1213) — it barely grabs more than it keeps.
3. jemalloc and mimalloc-v2 have a wide gap — they spike high under load, then claw a lot back during idle.

- the modern mimalloc forks (v3, bun) are the leanest
- tcmalloc is nearly as lean and barely peaks under this load
- mimalloc v2 retains far more than the others here, consistent with the large v2 → v3 memory improvement

5 · Retain vs. return | why they differ

Every allocator faces a version of the choice from §1: when memory is freed, keep it nearby for reuse, or make some of it reclaimable by the OS? Returning memory sounds like the tidy default. It isn't free:

It usually requires a syscall, such as madvise to discard physical backing from an existing mapping, or munmap to remove a mapping entirely.
If that memory is touched again after being discarded, the process pays for a page fault, and the kernel may need to provide a freshly zeroed page.
Changing mappings can require TLB invalidation on CPUs using the same address space — a real cross-core cost on a busy box.

Retained memory avoids most of that: already mapped, already faulted in, ready for reuse. So retention is a speed bet. Allocators such as jemalloc and tcmalloc were designed for allocation-heavy server workloads where malloc and free can sit on hot paths.

The cleanest way to think about retained memory is: it's a cache. With RAM to spare, it's useful headroom: memory the allocator can reuse without going back through the kernel. Under a hard container limit or on a packed machine, that same cache can become a liability: pages kept for future speed still count against your RSS until the allocator returns them.

The peak-to-settled gap is each allocator's fingerprint

That single policy explains the shapes in the chart above. Reading the gap between each allocator's two bars:

tcmalloc	Tight and proactive. Returns pages aggressively during load, so peak ≈ settled (1216 → 1213) — a flat plateau and the lowest peak of all six. It never builds the bulge in the first place.
jemalloc	The classic retain-then-decay. Highest non-mimalloc peak (1897), but its background decay timers return a big chunk during idle, settling to 1514. The bulge is real but temporary.
glibc	Returns only ~13% (1387 → 1209). Its `trim` mechanism can only shrink the top of the heap; the mid-heap holes the churn workload creates are structurally unreturnable by any glibc setting — an architectural limit, not something a knob can fix.
mimalloc	Purges (decommits) freed pages on a timer by default; the delay is version-dependent — 10 ms in v2, 1000 ms in v3. The lineage story is below.

The mimalloc lineage is a real, large improvement

The three mimalloc builds are the same allocator family at three points in its evolution, and the progression is unambiguous:

mimalloc-v2 · worst tier

2306 peak / 1711 settled. The original v2 design fragments hardest of all six here.

mimalloc-v3 · the rewrite

1382 peak / 1101 settled. Microsoft's v3 rewrite nearly halves the peak and cuts settled to the leading tier.

mimalloc-bun · leanest

1282 peak / 1099 settled. Bun's fork of v3 is the leanest config in the whole study.

In this workload, the peak-to-settled shape is the clearest view of each allocator's return behavior. tcmalloc stayed flat, jemalloc retained memory during load and released a large share during idle, and glibc returned less because many freed holes stayed pinned inside heap regions.

The mimalloc v2 → v3 → bun progression shows a large memory improvement: peak nearly halves from v2 to v3, and v3/bun settle in the leanest tier. Retained memory is best treated as a cache: useful with headroom, risky under hard limits.

6 · Why request latency barely moved

Across all six allocators, p99 request latency sat in a flat band of roughly 8–11 ms. That is the important negative result: allocator choice moved resident memory much more than it moved request latency in this service.

The likely reason is that allocator speed is not the dominant cost in this workload. Three layers explain why:

Process-level concurrency. The service uses multiple uvicorn workers, but each worker runs Python bytecode under the GIL, so allocator-heavy multithreaded contention inside a single process is limited.
pymalloc. CPython has its own small-object allocator that intercepts everything up to 512 bytes, so many small allocations never reach the allocator swapped in with LD_PRELOAD.
Endpoint work. The handlers spend enough time parsing and processing JSON/XML that allocator speed is not a visible part of p99 request latency here.

That makes this study useful for comparing allocator-driven resident memory under this workload, but not a general allocator throughput benchmark. To expose raw malloc / free speed differences, allocations need to sit directly on the hot path.

7 · C++ control: touched churn throughput

To separate the allocator from the Python runtime, we reran the same six allocators in a bare C++ program: no GIL, no pymalloc, one process, four threads, and the same churn pattern sized to match the service's live set. This is not a pure malloc/free loop. Each allocation is touched page by page, one random survivor is freed, and the live set stays bounded. The metric is churn-with-page-touch throughput.

Memory collapses into a tight band. All six land between 718 and 760 MB — a 42 MB spread, versus the 600+ MB spread the service showed. The dumbbell below puts the two side by side: the hollow dot is the bare-C++ footprint, the filled dot is the service footprint, and the amber gap is mostly the added footprint from the runtime and mixed request workload.

Memory: bare C++ ○ → CPython service ● · settled MB · the gap is mostly runtime + mixed workload

70012251750

glibc

718→1209 +491

tcmalloc

728→1213 +485

mimalloc-bun

745→1099 +354

mimalloc-v3

745→1101 +356

jemalloc

760→1514 +754

mimalloc-v2

742→1711 +969

○ hollow = bare C++ settled● filled = CPython service settledamber gap = runtime + mixed workload footprint

In bare C++ the six allocators cluster at 718–760 MB; the same six in the service fan out across 1099–1711 MB. Most of the service's allocator-to-allocator memory divergence is created by the runtime and mixed workload, not by the bare churn allocator. (Caveat: the C++ control is churn-only — it omits the JSON/XML transients — so it isolates the churn allocator, not the full workload.)

In the C++ control, throughput spreads out. Three repeats kept the ordering stable, with max-min spreads under 2% for every allocator:

Bare C++ churn throughput · million allocations per second

1tcmalloc

11.57M alloc/s

2glibc

9.83M alloc/s

3mimalloc-v2

9.44M alloc/s

4mimalloc-v3

7.51M alloc/s

5mimalloc-bun

7.40M alloc/s

6jemalloc

4.96M alloc/s

Open-loop C++ control, K=3 mean. Each operation allocates mixed-size buffers, touches every page, frees one survivor, and keeps a bounded live set. tcmalloc runs at 11.57 M alloc/s and jemalloc at 4.96, a 2.3× spread. The mimalloc lineage is stable across reps: v2 9.44 → v3 7.51 → bun 7.40. That is a property of this workload, not a claim about intrinsic allocator speed.

In the CPython service, allocator choice moved memory far more than p99 request latency.
In the bare C++ touched-churn control, allocator choice moved throughput far more than settled memory.
The practical lesson is workload scope: this benchmark measures a touched, fragmenting allocation loop, not pure allocator speed.

8 · The full data table

The six allocators on their stock defaults — every library running its own design philosophy, which is the only fair stock-vs-stock comparison. Sorted by settled, leanest first.

Allocator	Peak MB	Settled MB	p99 ms	C++ M/s
mimalloc-bun	1282	1099	9	7.40
mimalloc-v3	1382	1101	8	7.51
glibc	1387	1209	8	9.83
tcmalloc	1216	1213	8	11.57
jemalloc	1897	1514	11	4.96
mimalloc-v2	2306	1711	10	9.44

"C++ M/s" is the bare-C++ churn throughput from §7 (million allocations per second). The service columns are single-run, so read sub-2% gaps — mimalloc-bun vs v3 settled, or 8 vs 9 ms p99 — as ties, not rankings.

Why defaults only. Each allocator exposes a knob to return memory to the OS more aggressively — but they are not the same knob: the lever, its magnitude, and even its direction differ. glibc and jemalloc retain by default, so their knob lowers RSS; mimalloc purges by default, so its comparable knob — the purge delay — runs the other way: lengthening it makes mimalloc retain more. Comparing the adjusted numbers would rank moves that point different ways, so we leave per-allocator knob-twiddling out and report each allocator exactly as it ships.

9 · What this means for you

Picking an allocator: choose by your constraint

There's no single winner, because peak and settled answer different questions:

If your limit is sized for the busy moment (a container memory cap, an autoscaler threshold), peak is what kills you. Favor a proactive returner — tcmalloc had the lowest peak (1216) and barely bulges.
If you care about steady footprint between bursts (density, cost per idle instance), settled is the number. The mimalloc v3/bun forks lead (~1100 MB).
If you can't change the allocator at all, the default glibc is more competitive than its reputation (1209 settled) — its weakness is that it structurally can't return mid-heap holes, not that it's wasteful by default.

Final words:

Remember that retained memory is a cache that flips to a liability when constrained. Most deployments are constrained because RAM is expensive. Especially long running processes/workers with no built in recycling can suffer more from fragmentation and a choosing the right allocator and configuration might help along the way.

This is why container deployments tune memory down — capping glibc's arena count with MALLOC_ARENA_MAX, lowering its MALLOC_TRIM_THRESHOLD_, and shortening jemalloc/mimalloc decay timers: they trade a slice of allocation speed — which did not show up in p99 for this service — to keep the footprint under the limit.