This post explains how the memory allocator — a library you've probably never configured — can silently inflate your memory footprint and your bills.
Your application layer frees almost all the memory it allocates, but you still see a staircase in your RAM monitoring graph. That's because the freed memory is not immediately returned to the operating system: it can stay readily available to your application code for longer than you may expect, which is what makes the allocator choice matter here. To show this, we ran one identical Python web application with 6 different allocators. The leanest settled at 1099 MB; the heaviest peaked at 2306 MB — more than double, while latency stayed similar across all.
1 · What an allocator actually does
Your program constantly needs scratch memory: a list grows, a
request gets parsed, a buffer fills. When it needs memory it
calls malloc; when it's done it calls
free. But malloc and
free don't talk to the operating system on every
call, the overhead would be too much.
Instead, a library called the
allocator sits between your program and
the OS. It grabs memory from the kernel in big chunks, then
hands out small pieces from those chunks as your code calls
malloc, taking them back into its own pool on
free.
Here's the part that surprises people who are abstracted away
from allocators:
when application code frees memory, the allocator can still
keep it.
In that case, it holds the freed space to satisfy your next
malloc instantly, rather than handing it back to
the kernel. So a long-running service can free nearly everything
and still look "fat" to the OS — the memory is sitting idle
inside the allocator's pool, reserved for reuse.
Every program already has an allocator whether you chose one or
not. On a typical Linux Python image it's
glibc's built-in ptmalloc2. You
can swap it for another — jemalloc, tcmalloc, mimalloc — by
setting one environment variable (LD_PRELOAD) at
startup. No code changes. That's what makes this measurable:
same binary, same traffic, only the allocator changes.
2 · Why memory climbs: fragmentation
If the allocator just kept a flat pool and reused it, footprint would plateau and we'd have no story. The reason it climbs upward instead is fragmentation.
Picture the allocator's pool as a long shelf, filled left to right. Some buffers are short-lived (a parsed request, a serialized response); some stick around for the whole process lifetime (caches, router tables, connection state). When the short-lived ones are freed, they leave gaps — but those gaps sit between long-lived buffers that are still in place.
The allocator usually can only hand a whole region back to the OS when everything in that returnable unit is free.
A shelf the allocator can't give back
This is the RSS staircase: each burst of
traffic carves new holes that get pinned under survivors, and
total resident memory ratchets up step by step — even though
much of it is technically free. The workload in this study is
built to provoke exactly this: a /churn endpoint
that keeps a large, bounded set of mixed-size (4–64 KB)
buffers alive and frees a random subset on each call, so freed
holes are reliably trapped beneath survivors. That's the
fragmentation engine the allocators have to cope with.
Anonymous memory over time — stock defaults · the RSS staircase
QUESTIONS we ask:
- Given a workload that fragments, how much does each allocator's restock-or-refund policy actually cost in resident memory?- Does buying lower memory cost you anything in speed?
3 · The experiment
One synthetic FastAPI web service, four
uvicorn workers, no per-request recycling
(recycling would mask the very fragmentation we're measuring).
It's hit with a constant 200 req/s of mixed traffic for ten
minutes, then left to idle. The only thing that changes between
runs is the allocator.
The six allocators
ptmalloc2- jemalloc 5.3
- tcmalloc (gperftools)
- mimalloc v2
- mimalloc v3
- mimalloc v3 bun fork
What's measured
memory.stat — the
only component the allocator governs. (memory.current
mixes in page cache and would mask the effect.)
p50/p95/p99 latency and throughput from
vegeta.
Peak vs. settled
Settled = anonymous memory after a 120 s idle phase, once the allocator's background return timers have had a chance to fire.
Confounds pinned
- transparent huge pages off
- CPU governor fixed
- load generator pinned to separate cores
- page cache dropped between runs.
The big memory confounds are held constant
A note on "settled." We also fire a forced
memory.reclaim at the end, but with swap off it
returns essentially nothing — the kernel can't evict dirty
anonymous pages with nowhere to put them. So the real normalizer
is the idle phase: memory is returned
before the reclaim call, either during load (by
proactive allocators) or during idle (by decay timers).
"Settled" means "after the allocator's own return mechanisms
have run," not "after we forced a reclaim." That distinction
matters, and getting it wrong is an easy methodology trap.
4 · The headline: six allocators ranked
Here is the result with each allocator on its stock defaults
Faint bar = peak during load
solid bar = settled after idle
Sorted leanest-settled first.
Anonymous memory — stock defaults · peak → settled (MB)
Interpretation:
1. The leanest settled allocators are the modern mimalloc forks
(bun and v3, ~1100 MB),
with the default glibc surprisingly competitive (1209).
2. The bars tell a richer story than a single ranking: look at
how differently each one moves from peak to settled. tcmalloc's
two bars are almost identical (1216 → 1213) — it barely grabs
more than it keeps.
3. jemalloc and mimalloc-v2 have a wide gap — they spike high
under load, then claw a lot back during idle.
- the modern mimalloc forks (v3, bun) are the leanest
- tcmalloc is nearly as lean and barely peaks under this
load
- mimalloc v2 retains far more than the others here,
consistent with the large v2 → v3 memory improvement
5 · Retain vs. return | why they differ
Every allocator faces a version of the choice from §1: when memory is freed, keep it nearby for reuse, or make some of it reclaimable by the OS? Returning memory sounds like the tidy default. It isn't free:
-
It usually requires a syscall, such as
madviseto discard physical backing from an existing mapping, ormunmapto remove a mapping entirely. - If that memory is touched again after being discarded, the process pays for a page fault, and the kernel may need to provide a freshly zeroed page.
- Changing mappings can require TLB invalidation on CPUs using the same address space — a real cross-core cost on a busy box.
Retained memory avoids most of that: already mapped, already
faulted in, ready for reuse. So
retention is a speed bet. Allocators such as
jemalloc and tcmalloc were designed for allocation-heavy server
workloads where malloc and free can
sit on hot paths.
The cleanest way to think about retained memory is: it's a cache. With RAM to spare, it's useful headroom: memory the allocator can reuse without going back through the kernel. Under a hard container limit or on a packed machine, that same cache can become a liability: pages kept for future speed still count against your RSS until the allocator returns them.
The peak-to-settled gap is each allocator's fingerprint
That single policy explains the shapes in the chart above. Reading the gap between each allocator's two bars:
| tcmalloc | Tight and proactive. Returns pages aggressively during load, so peak ≈ settled (1216 → 1213) — a flat plateau and the lowest peak of all six. It never builds the bulge in the first place. |
| jemalloc | The classic retain-then-decay. Highest non-mimalloc peak (1897), but its background decay timers return a big chunk during idle, settling to 1514. The bulge is real but temporary. |
| glibc |
Returns only ~13% (1387 → 1209). Its
trim mechanism can only shrink the
top of the heap; the mid-heap holes the churn
workload creates are structurally unreturnable by any
glibc setting — an architectural limit, not something a
knob can fix.
|
| mimalloc | Purges (decommits) freed pages on a timer by default; the delay is version-dependent — 10 ms in v2, 1000 ms in v3. The lineage story is below. |
The mimalloc lineage is a real, large improvement
The three mimalloc builds are the same allocator family at three points in its evolution, and the progression is unambiguous:
mimalloc-v2 · worst tier
mimalloc-v3 · the rewrite
mimalloc-bun · leanest
The mimalloc v2 → v3 → bun progression shows a large memory improvement: peak nearly halves from v2 to v3, and v3/bun settle in the leanest tier. Retained memory is best treated as a cache: useful with headroom, risky under hard limits.
6 · Why request latency barely moved
Across all six allocators, p99 request latency sat in a flat band of roughly 8–11 ms. That is the important negative result: allocator choice moved resident memory much more than it moved request latency in this service.
The likely reason is that allocator speed is not the dominant cost in this workload. Three layers explain why:
-
Process-level concurrency. The service uses
multiple
uvicornworkers, but each worker runs Python bytecode under the GIL, so allocator-heavy multithreaded contention inside a single process is limited. -
pymalloc. CPython has its own small-object
allocator that intercepts everything up to 512 bytes,
so many small allocations never reach the allocator swapped
in with
LD_PRELOAD. - Endpoint work. The handlers spend enough time parsing and processing JSON/XML that allocator speed is not a visible part of p99 request latency here.
That makes this study useful for comparing allocator-driven
resident memory under this workload, but not a general allocator
throughput benchmark. To expose raw malloc /
free speed differences, allocations need to sit
directly on the hot path.
7 · C++ control: touched churn throughput
To separate the allocator from the Python runtime, we reran the
same six allocators in a bare C++ program: no GIL, no pymalloc,
one process, four threads, and the same churn pattern sized to
match the service's live set. This is not a pure
malloc/free loop. Each allocation is
touched page by page, one random survivor is freed, and the live
set stays bounded. The metric is
churn-with-page-touch throughput.
Memory collapses into a tight band. All six land between 718 and 760 MB — a 42 MB spread, versus the 600+ MB spread the service showed. The dumbbell below puts the two side by side: the hollow dot is the bare-C++ footprint, the filled dot is the service footprint, and the amber gap is mostly the added footprint from the runtime and mixed request workload.
Memory: bare C++ ○ → CPython service ● · settled MB · the gap is mostly runtime + mixed workload
In the C++ control, throughput spreads out. Three repeats kept the ordering stable, with max-min spreads under 2% for every allocator:
Bare C++ churn throughput · million allocations per second
In the bare C++ touched-churn control, allocator choice moved throughput far more than settled memory.
The practical lesson is workload scope: this benchmark measures a touched, fragmenting allocation loop, not pure allocator speed.
8 · The full data table
The six allocators on their stock defaults — every library running its own design philosophy, which is the only fair stock-vs-stock comparison. Sorted by settled, leanest first.
| Allocator | Peak MB | Settled MB | p99 ms | C++ M/s |
|---|---|---|---|---|
| mimalloc-bun | 1282 | 1099 | 9 | 7.40 |
| mimalloc-v3 | 1382 | 1101 | 8 | 7.51 |
| glibc | 1387 | 1209 | 8 | 9.83 |
| tcmalloc | 1216 | 1213 | 8 | 11.57 |
| jemalloc | 1897 | 1514 | 11 | 4.96 |
| mimalloc-v2 | 2306 | 1711 | 10 | 9.44 |
9 · What this means for you
Picking an allocator: choose by your constraint
There's no single winner, because peak and settled answer different questions:
- If your limit is sized for the busy moment (a container memory cap, an autoscaler threshold), peak is what kills you. Favor a proactive returner — tcmalloc had the lowest peak (1216) and barely bulges.
- If you care about steady footprint between bursts (density, cost per idle instance), settled is the number. The mimalloc v3/bun forks lead (~1100 MB).
- If you can't change the allocator at all, the default glibc is more competitive than its reputation (1209 settled) — its weakness is that it structurally can't return mid-heap holes, not that it's wasteful by default.
Final words:
Remember that retained memory is a cache that flips to a liability when constrained. Most deployments are constrained because RAM is expensive. Especially long running processes/workers with no built in recycling can suffer more from fragmentation and a choosing the right allocator and configuration might help along the way.
This is why container deployments tune memory down — capping
glibc's arena count with MALLOC_ARENA_MAX,
lowering its MALLOC_TRIM_THRESHOLD_, and shortening
jemalloc/mimalloc decay timers: they trade a slice of
allocation speed — which did not show up in p99 for this
service — to keep the footprint under the limit.