frn.sh

Sigterm a D state process

Load average hit 12 on a 2 vCPU machine during a production incident. My first thought was that CPU must be the bottleneck - 12 is 6x the core count.

It wasn’t.

Linux load average counts three things: processes running on a CPU, processes waiting in the run queue, and processes in uninterruptible sleep - D state. From the kernel source:

The global load average is an exponentially decaying average of
nr_running + nr_uninterruptible.

Regular sleep doesn’t count, so if load average says 12, those 12 were either running, runnable, or in D. A system doing heavy disk I/O can show high load average with near-zero CPU utilization.

Fine. My processes were waiting on disk, disk waits put processes in D state, D state counts toward load average, and load average was 12. Makes sense.

Then I killed them. I sent SIGTERM and they died immediately.

That shouldn’t work. If a process is in D state, signals get queued - the process doesn’t respond until the kernel wakes it up. That’s the whole point of uninterruptible sleep: the kernel is protecting an operation that can’t be safely interrupted. Not even SIGKILL works.

I tested it. Wrote a kernel module that forces a process into D state:

set_current_state(TASK_UNINTERRUPTIBLE);
schedule_timeout(60 * HZ);

Loaded it, confirmed the process was in D, and sent SIGKILL.

root@debian:/proc/1506781# kill -9 1506781
root@debian:/proc/1506781# cat status | grep -i SigPnd
SigPnd: 0000000000000100

Signal pending, process still alive, still in D. Bit 8 set - that’s SIGKILL. It sat there, unkillable, for 60 seconds until the timeout expired and the kernel woke it up. Only then did it process the signal and die.

Even strace couldn’t attach - ptrace uses the same mechanism, and nothing gets through (traces of despair):

root@debian:~# strace -c -p 1506781
strace: Process 1506781 attached
^C^C
^C^C^C^C

So here’s the contradiction. Load average was 12, processes were waiting on disk, I killed them with SIGTERM and they died instantly. But processes in D state don’t die on signals. If they died on SIGTERM, they were probably in S state - but S state doesn’t count toward load average.

A few possible explanations.

Processes don’t stay in one state. A process doing disk I/O goes D → S → D → S as individual reads complete and new ones start. Load average samples every 5 seconds, so it could catch them in D at sampling time even if they spend most of their time in S. The ones I killed might have been in an S window when the signal arrived.

Or run queue pressure - 77 processes on 2 cores. Even if most are sleeping, the ones that wake up create queue depth, and some of those 12 could have been runnable processes waiting for a core.

Or TASK_KILLABLE. Since 2.6.25 there’s a third sleep state: TASK_UNINTERRUPTIBLE | TASK_WAKEKILL. It shows up as D in /proc and counts toward load average, but it dies on fatal signals. Some filesystem code paths use it. If the I/O waits were hitting TASK_KILLABLE paths, that would explain everything - processes appear in D, contribute to load average, but die on signal.

That last one is the most satisfying answer, and it resolves the contradiction completely.

I haven’t traced the kernel code path to confirm it. So I’m not going to say that’s what happened.