frn.sh

Load average lied to me

During a production incident I watched load average hit 12 on a 2 vCPU machine. That’s 6x the core count. My first instinct: CPU is the bottleneck, processes are fighting for time on the cores. But they weren’t.

What load average actually counts

Linux load average is not a CPU metric. It counts three things: processes running on a CPU, processes waiting in the run queue, and processes in uninterruptible sleep (D state). From the kernel source:

 * The global load average is an exponentially decaying average of nr_running +
 * nr_uninterruptible.

Processes in regular sleep (S state) don’t count. So if load average says 12, those 12 processes were either running, runnable, or in D.

The D state is what makes Linux load average different from other unixes. A system doing heavy disk I/O can show high load average with near-zero CPU utilization. The number stops meaning “CPU pressure” and starts meaning “system demand” in a broader sense.

So far so good. My processes were waiting on disk. Disk waits often put processes in D state. D state counts toward load average. Load average was 12. Things are making sense.

Maybe.

The signal test

During the incident I killed the waiting processes. Sent them SIGTERM and they died immediately.

That’s the problem. If a process is in D state, signals get queued. The process doesn’t respond until the kernel wakes it up. That’s the entire point of uninterruptible sleep - the kernel is protecting an operation that can’t be safely interrupted. Not even SIGKILL works.

Damn it. Ok. I tested it.

Proving D state blocks signals

I wrote a kernel module that forces a process into D state:

#include <linux/module.h>
#include <linux/sched.h>
#include <linux/delay.h>

static int __init dstate_init(void) {
    set_current_state(TASK_UNINTERRUPTIBLE);
    schedule_timeout(60 * HZ);
    return 0;
}

static void __exit dstate_exit(void) {}

module_init(dstate_init);
module_exit(dstate_exit);
MODULE_LICENSE("GPL");

set_current_state(TASK_UNINTERRUPTIBLE) marks the task as uninterruptible. schedule_timeout(60 * HZ) puts it to sleep for 60 seconds and yields the CPU. The process will sit in D state until the timeout expires.

Built it and loaded it:

make
insmod dstate.ko &
[1] 1506781

Confirmed the state:

root@debian:/proc/1506781# cat status
Name:   insmod
Umask:  0022
State:  D (disk sleep)
SigPnd: 0000000000000000

No pending signals. Now send SIGKILL:

root@debian:/proc/1506781# kill -9 1506781
root@debian:/proc/1506781# cat status | grep -i SigPnd
SigPnd: 0000000000000100

The signal is pending. The process didn’t die. It’s still in D. The hex value 0000000000000100 means bit 8 is set - that’s SIGKILL (signal 9, zero-indexed at bit 8).

The process sat there, unkillable, until the 60-second timeout expired and the kernel woke it up. Only then did it process the pending signal and die.

Even strace couldn’t attach (traces of despair):

root@debian:~# strace -c -p 1506781
strace: Process 1506781 attached
^C^C
^C^C^C^C

Nothing. strace uses ptrace(2), which also can’t interrupt D state.

The contradiction

So here’s what I know:

  1. Load average was 12 on a 2 vCPU machine.
  2. Most processes were waiting on disk I/O.
  3. I killed them with SIGTERM and they died immediately.
  4. Processes in D state don’t die on signals.

If they died on SIGTERM, they were probably in S state (interruptible sleep). But S state doesn’t count toward load average. So where did the load come from?

Possible explanations

State transitions. Processes don’t stay in one state permanently. A process doing disk I/O might go DSDS as individual read requests complete and new ones start. Load average samples periodically (every 5 seconds via calc_global_load_tick). It could catch processes in D at sampling time even if they spend most of their time in S. The ones I killed might have been in an S window when I sent the signal.

Run queue pressure. 77 processes on 2 cores. Even if most are sleeping, the ones that wake up to do work create run queue depth. Context switching overhead is real. Some of those 12 could have been runnable processes waiting for a core, not sleeping processes at all.

Mixed population. Some processes in D, some in S, some runnable. The ones I killed happened to be interruptible at that moment. The load average reflected the aggregate, not any single process.

TASK_KILLABLE. Since Linux 2.6.25, there’s a third sleep state: TASK_KILLABLE (technically TASK_UNINTERRUPTIBLE | TASK_WAKEKILL). It shows up as D in /proc/PID/status and counts toward load average, but the process can be killed by fatal signals. Some filesystem code paths use it. If the I/O waits were hitting TASK_KILLABLE code paths internally, that would explain everything: processes appear in D, contribute to load average, but die on signal. I haven’t confirmed whether the specific code paths involved use TASK_KILLABLE or plain TASK_UNINTERRUPTIBLE.

What I still don’t know

I don’t have a clean answer. I know the processes died on SIGTERM. I know load average was 12. I don’t fully understand the relationship between those two facts yet.

The TASK_KILLABLE explanation is the most satisfying. It resolves the contradiction completely. But I haven’t traced the kernel code path to confirm it, so I’m not going to pretend I have.

What I do know: load average on Linux is weirder than it looks. It’s not CPU utilization. It’s not even “demand for CPU.” It includes disk I/O waiters, and the boundary between interruptible and uninterruptible sleep is blurrier than the textbook version suggests.