Why can a single small allocation prevent hundreds of megabytes of freed heap memory from being returned to the kernel?
If your heap grows via brk(), the kernel only knows one number: where the program break is. It's a high-water mark. brk() can only move it down.

So if you free a 500MB chunk in the middle of the heap but there's a 64-byte allocation sitting above it, the break can't move. Those pages stay mapped. Rsstays high.

That's why you know where to look: if Rss won't drop after freeing, check if there's a small surviving allocation near the top of the heap pinning everything below. The allocator can't skip over it.   

Why does Rss stay high after an application "frees" memory?
free(0 returns memory to the allocator's internal freelist - no syscall nor kernel involvement. The kernel only sees pages. It has no visibility into which bytes within a page are "free" in userspace. A page with one live byte is fully resident.

The allocator got those pages via brk() or mmap(). For mmap'd chunks (which are large allocations), free() calls munmap() directly - kernel reclaims the frames, Rss drops. For heap-managed chunks, the pages stay mapped. brk() can only shrink from the top, so a single surviving allocation at the top of the heap pins everything below it. Even if 99% of the heap is logically free, the break can't move.

glibc can call madvise(MADV_DONTNEED) to release physical frames without unmapping - but only when conditions align (enough contiguous free space, trim threshold exceeded). The virtual mapping stays, Rss drops, next touch faults in a fresh zeroed page.

The gap between "allocator says it's free" and "kernel knows it's free" is where Rss inflation lives.

If two processes map the same shared library, what is shared and what is not?
What is shared is the physical page frames containing the file-backed clean pages. Two processes can both map libc.so, each with different virtual addresses if needed, while both PTE sets reference the same underlying frames. Rss counts those mapped resident pages in each process as if they were all there. Pss tries to fix that by dividing each shared resident page by the number of processes mapping it.