clone, unshare, and setns

The Linux kernel has three interfaces for namespaces: clone, unshare, and setns. clone creates a new process and lets you specify which namespaces the child should share with its parent and which ones should be created fresh - this is what happens under the hood when you run a container. setns lets a process enter an existing namespace, which is exactly what docker exec does. And unshare lets you manipulate namespaces from the shell, which makes it the most fun to play with.

root@debian:~# unshare --pid --mount-proc --fork /bin/bash
root@debian:~# ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.0   7196  3928 pts/1    S    00:31   0:00 /bin/bash
root           2  0.0  0.0  11084  4388 pts/1    R+   00:31   0:00 ps aux

This creates a child process with its own PID and proc namespaces, so /proc is fresh and process IDs start at 1. If we compare namespace IDs between parent and child, we can see what’s shared and what isn’t:

# Parent process
mnt -> 'mnt:[4026532895]'
net -> 'net:[4026531840]'
pid -> 'pid:[4026532897]'

# Child process
mnt -> 'mnt:[4026531841]'
net -> 'net:[4026531840]'
pid -> 'pid:[4026531836]'

The only namespace with the same ID is net - everything else is isolated. But seeing different IDs in /proc is abstract. I wanted to test what isolation actually means in practice.

I created a network namespace with ip netns, entered it, and started a server:

root@debian:~# ip netns add fakens
root@debian:~# ip netns exec fakens bash
root@debian:~# ip link set lo up
root@debian:~# nc -l 80

From the host:

➜  ~ nc localhost 80
localhost [127.0.0.1] 80 (http) : Connection refused

Connection refused. The host and the namespace have completely separate network stacks - their own routing tables, iptables rules, socket listings. The namespace didn’t just hide the host’s interfaces, it created an entirely new protocol stack for the process inside it.

This is the experiment that made namespaces click for me. PID namespaces are easy to demonstrate but hard to feel - you see a different process tree, but so what? With network namespaces you can actually prove the isolation by trying to connect and failing. The process inside the namespace is genuinely unreachable from the host, running on the same machine, separated only by a kernel abstraction.