Various ways memory misbehaves

This post explores the ways memory misbehaves in production. This an examination of possibilities, grounded in how things actually work.

I originally wanted to create a Q/A post about practical cases of memory behavior in production. But reflecting on scenarios might be the better choice. This is the post.

The allocator keeps the pages - in use bytes

The application calls free(). glibc marks the space as available internally but doesn’t return the page to the kernel. The kernel still sees a resident, dirty, anonymous page. Rss doesn’t move.

What would make glibc return the page to the kernel?

I found out 2 ways: malloc_trim(0 and munmap(). The first one works only on the top, since free pages in the middle of the heap cannot be trimmed. munmap deletes mappings for a specific addres space. Pretty direct.

man 3 malloc_trim. man 2 mmap (search for munmap).

This is normal behavior. glibc expects the application to allocate again soon. Returning pages to the OS via madvise() or munmap() and then requesting them again is expensive - it means page faults, TLB flushes, zeroing.

What happens when a process reads or writes memory

The MMU (a separate piece of hardware) translates virtual addresses to physical ones by walking the page table. If the translation is cached in the TLB, it’s fast - the kernel isn’t involved at all.

If the MMU finds no PTE for that address, it raises a page fault, then the kernel handles it: finds which VMA owns the address, allocates a physical frame from the bddy allocator, fills it (from disk if file-backed, with zeros if anonymous), creates the PTE, and returns. The CPU retries. Now that page is resident.

This is “demand paging”. We have address ranges in VMA, but physical frames arent allocated until the process actually needs them. So a 100 MB mmap adds 100 MB to the process’ virtual size, but not to its resident memory (which will grow only in the first access)

Returning pages is expensive because when the page is returned the kernel removes the PTEs and invalidates TLB entries. When the application allocates again, every access to those addresses will fault one more time - then the kernel allocates, zero them, creates PTE, all over agian. glibc tries to avoid this by keeping pages mapped and reusing them internally.

But it creates a gap. After a traffic spike, the application is idle, all objects are freed, and Rss is still at the spike level. You can find the gap in /proc/pid/smaps: Private_Dirty pages with no live objects in them. But the kernel has no idea about what’s being used and what’s not. So we are blind here.

We need the userspace to tell us what’s really happning, because free() and malloc() are the libc responsabilities. malloc_stats() shows it - “system bytes” is what the kernel gave, “in use bytes” is what the application is actually using. The difference is pages glibc is holding.

I ran python3 -m http.server again to call malloc_stats:

root@debian:~# gdb --batch --pid 1161684 -ex 'call (void)malloc_stats()'
0x00007f1f94435687 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
Arena 0:
system bytes     =    2543616
in use bytes     =    2215952
Total (incl. mmap):
system bytes     =    3362816
in use bytes     =    3035152
max mmap regions =          4
max mmap bytes   =    1028096
[Inferior 1 (process 1161684) detached]

The allocator keeps the pages - brk()

The heap (brk region) makes this worse. brk() grows the heap upward but can only shrink it from the top. One live allocation near the top pins everything below it. mmap-backed allocations don’t have this problem - each one is an independent mapping that goes away on munmap().

glibc switches between brk() and mmap() based on allocation size. The threshold defaults to 128 KB but adjusts dynamically. Small allocations go to the heap. Large ones go to mmap. If your application alternates between small and large allocations, the behavior is harder to predict.

If you suspect the heap is pinned, compare the heap VMA size with what malloc_stats() reports for Arena 0. Arena 0 is the main arena, which uses brk(). If Arena 0 shows “system bytes” = 2.5 MB but “in use bytes” = 500 KB, there’s 2 MB of free space inside the heap that glibc can’t return because something near the top is pinning it. The exact location of the pinning allocation - you’d need to walk glibc’s internal data structures with gdb to find it. Not something you do in production.

Arena fragmentation

glibc creates separate memory pools (arenas) for different threads to avoid lock contention on malloc(). The default limit is 8 x number of CPU cores. On a 4-core machine, that’s up to 32 arenas.

Each arena grabs memory from the OS in large chunks. When a thread frees memory, glibc reclaims it inside that arena but doesn’t necessarily return it to the kernel. If allocations are spread across 32 arenas and most of the live data ends up in 3 of them, the other 29 are mostly empty but still mapped.

We hit this with WAHA. The fix was MALLOC_ARENA_MAX=2. Memory usage dropped 40%. The same pattern shows up in Java (glibc arenas underneath the JVM), Ruby (the GVL makes multiple arenas pointless), and Python.

You can see it in /proc/pid/smaps. Count the anonymous rw-p mappings. If there are hundreds or thousands of similarly-sized regions, that’s arenas. malloc_info() (the XML version of malloc_stats()) shows per-arena breakdown.

Reclaim has three layers

We have three layers: kernel, allocator (libc) and the runtime.

The kernel reclaims pages when free memory drops below a watermark. kswapd wakes up and starts evicting: file-backed clean pages get dropped (they can be reloaded from disk), file-backed dirty pages get flushed then dropped, anonymous pages go to swap if swap exists. Gorman explains zones, watermarks and kswapd very well.

The allocator (glibc) sits between the kernel and the application. It gets pages from the kernel via brk() and mmap(), carves them into chunks, and hands them to the application. When the application calls free(), glibc reclaims the chunk internally but usually keeps the page. malloc_trim() can return free pages at the top of arenas to the kernel, but nobody calls it automatically. glibc has an internal trim threshold - after free(), if free space at the top of the heap exceeds 128 KB (configurable via M_TRIM_THRESHOLD), it trims. But this only works for the top. Fragmented free space in the middle stays.

The runtime (V8, JVM, Go) manages its own heap on top of the allocator. It mmap()s large regions, subdivides them internally, and runs garbage collection to free dead objects. When GC runs, it frees space inside the runtime’s heap. Whether those pages go back to the OS depends on the runtime and its mood.

These layers don’t talk to each other. The kernel doesn’t tell glibc “I need pages back.” glibc doesn’t tell V8 “you’re holding too much.” V8 doesn’t ask the kernel “are you under pressure.” The only signal that flows downward is destruction - the kernel swaps pages out or kills the process. There’s no polite negotiation.

cgroups v2 introduced memory.pressure (PSI - Pressure Stall Information), which lets a process monitor memory pressure and react before the OOM killer arrives. Almost no runtime uses it by default. Some JVM builds and systemd have experimental support. For now, most applications are blind to how close they are to being killed.

In containers, this matters more. The cgroup memory limit brings the OOM killer closer. The kernel sees 4 GB available on the node, but your pod has a 512 MiB limit. All three layers are holding memory because none of them see pressure - and then the cgroup OOM killer fires.


GC delay

Garbage-collected runtimes allocate memory from the OS and manage it internally. The GC frees dead objects but doesn’t necessarily return pages to the OS. And it doesn’t always run when you’d expect.

V8 (Node.js) has a default heap limit in the gigabytes. If heapUsed is 120 MiB, V8 has no reason to collect aggressively. Dead objects accumulate. The pages stay mapped. This is what we saw in the WAHA incident - 252 MiB of collectible garbage that V8 didn’t bother collecting because it was nowhere near its limit.

process.memoryUsage().heapUsed tells you live objects. It doesn’t tell you how much memory V8 has grabbed from the OS. Those are different numbers. heapUsed can be 120 MiB while the OS sees 400 MiB of anonymous pages belonging to V8.

Java is similar. The JVM rarely returns heap memory to the OS. -Xmx sets the ceiling, and the JVM grows toward it and stays there. G1GC added uncommit support, but it’s conservative. Go is better about this - it returns memory via madvise(MADV_FREE) - but the timing is unpredictable.

The fix depends on the runtime. For Node.js, –max-old-space-size puts pressure on the GC. For Java, tuning -XX:MaxHeapFreeRatio or switching to Shenandoah/ZGC. For Go, runtime.GC() exists but you usually don’t need it.

Page cache confusion

The kernel uses free RAM to cache file contents. After a process reads a large file, the pages stay in the page cache. free(1) shows low “free” memory and people think the system is out of RAM. It’s not. The “available” column is what matters - the kernel will drop cache pages the moment a process needs them.

This is the most common false alarm. The system is working correctly. The cache is using memory that would otherwise be wasted.

But there are real problems in this area too. A process that mmap()s a large file and accesses it randomly will pull pages into the cache aggressively. If the file is bigger than RAM, it can evict pages from other processes. This is the page cache thrashing problem. The kernel’s LRU approximation doesn’t always handle this well. posix_fadvise(POSIX_FADV_DONTNEED) tells the kernel to drop specific pages. In practice, people solve this with cgroups memory limits or by switching from mmap to read() with controlled buffer sizes.

Shared memory accounting

We showed that Shared_Clean pages in smaps_rollup are shared library code. Ten processes using libc share one copy in RAM, but all ten count it in their Rss. This makes per-process memory reporting misleading.

In containers, this matters. kubectl top shows container Rss. If you have 5 pods on a node, each reporting 300 MiB, you might think you’re using 1.5 GB. But 50 MiB of each pod is shared library code - same physical pages. Actual usage is closer to 1.25 GB. The gap grows with the number of pods.

Pss (Proportional Set Size) divides shared pages across the processes that use them. It’s a better number for capacity planning. smaps_rollup has it. Most monitoring tools don’t use it.

The kernel holds memory too

Not all memory growth is the application. The kernel allocates memory for its own data structures - slab caches for dentries, inodes, network buffers, file descriptors. These grow under load and shrink slowly.

A process that opens and closes millions of files will grow the dentry cache. The kernel will reclaim it under memory pressure, but not eagerly. slabtop shows the biggest kernel caches. /proc/meminfo’s SReclaimable vs SUnreclaim tells you how much the kernel can give back if needed.

File descriptor leaks are a variant of this. Every open socket has kernel-side buffers. Leak descriptors and you leak kernel memory. ls /proc/pid/fd | wc -l is the quick check.

tmpfs eating RAM

tmpfs is backed by memory, not disk. Anything written to /dev/shm, /tmp (if tmpfs), or /run stays in RAM (or swap). It doesn’t show up in any process’s Rss because it’s not mapped into a process - it’s in the page cache, attributed to the filesystem.

A cron job that writes temp files to /dev/shm and doesn’t clean them up is a slow memory leak that won’t show up in any per-process monitoring. df -h /dev/shm is the check.

MADV_FREE vs MADV_DONTNEED

A process can tell the kernel “I don’t need these pages anymore” without unmapping them. Two ways to do it, and they behave differently.

madvise(MADV_DONTNEED): the kernel reclaims the pages immediately. Next access causes a page fault and gets a fresh zero page. Rss drops.

madvise(MADV_FREE) (Linux 4.5+): the kernel marks the pages as lazily reclaimable. They stay in Rss until the kernel actually needs the memory. If the process writes to the page before the kernel reclaims it, the page is reused - no fault, no zeroing. This is faster for allocators that free and reallocate frequently.

Go switched to MADV_FREE in 1.12, then back to MADV_DONTNEED in 1.16 after users complained that Rss didn’t drop after GC. The memory was available to the system, but monitoring tools showed it as used. Go 1.16+ uses MADV_DONTNEED by default again.

This is purely an observability problem. The actual memory pressure is the same. But when your alerting fires on Rss, the distinction matters.

Each of these could be its own investigation. For now, this is the map.