frn.sh

They died on SIGTERM

Load average hit 12 on a 2 vCPU machine during a production incident. My first thought: CPU is the bottleneck. 12 is 6x the core count.

It wasn’t.

Linux load average counts three things: processes running on a CPU, processes waiting in the run queue, and processes in uninterruptible sleep. D state. From the kernel source:

The global load average is an exponentially decaying average of
nr_running + nr_uninterruptible.

Regular sleep doesn’t count. So if load average says 12, those 12 were either running, runnable, or in D. A system doing heavy disk I/O can show high load average with near-zero CPU utilization.

Fine. My processes were waiting on disk. Disk waits put processes in D state. D state counts toward load average. Load average was 12. Makes sense.

Then I killed them.

I sent SIGTERM. They died immediately.

That shouldn’t work. If a process is in D state, signals get queued. The process doesn’t respond until the kernel wakes it up. That’s the whole point of uninterruptible sleep - the kernel is protecting an operation that can’t be safely interrupted. Not even SIGKILL works.

I tested it. Wrote a kernel module that forces a process into D state:

set_current_state(TASK_UNINTERRUPTIBLE);
schedule_timeout(60 * HZ);

Loaded it. Confirmed the process was in D. Sent SIGKILL.

root@debian:/proc/1506781# kill -9 1506781
root@debian:/proc/1506781# cat status | grep -i SigPnd
SigPnd: 0000000000000100

Signal pending. Process still alive. Still in D. Bit 8 set - that’s SIGKILL. It sat there, unkillable, for 60 seconds until the timeout expired and the kernel woke it up. Only then did it process the signal and die.

Even strace couldn’t attach. ptrace uses the same mechanism. Nothing gets through.

root@debian:~# strace -c -p 1506781
strace: Process 1506781 attached
^C^C
^C^C^C^C

So here’s the contradiction:

Load average was 12. Processes were waiting on disk. I killed them with SIGTERM and they died instantly. But processes in D state don’t die on signals.

If they died on SIGTERM, they were probably in S state. But S state doesn’t count toward load average.

A few possible explanations.

Processes don’t stay in one state. A process doing disk I/O goes D → S → D → S as individual reads complete and new ones start. Load average samples every 5 seconds. It could catch them in D at sampling time even if they spend most of their time in S. The ones I killed might have been in an S window when the signal arrived.

Or run queue pressure. 77 processes on 2 cores. Even if most are sleeping, the ones that wake up create queue depth. Some of those 12 could have been runnable processes waiting for a core.

Or TASK_KILLABLE. Since 2.6.25, there’s a third sleep state: TASK_UNINTERRUPTIBLE | TASK_WAKEKILL. It shows up as D in /proc. It counts toward load average. But it dies on fatal signals. Some filesystem code paths use it. If the I/O waits were hitting TASK_KILLABLE paths, that would explain everything. Processes appear in D, contribute to load average, die on signal.

That last one is the most satisfying answer. It resolves the contradiction completely.

I haven’t traced the kernel code path to confirm it. So I’m not going to say that’s what happened.