What counts as CPU utilization?

What is really CPU utilization? Processes have multiple states: interruptible sleeping, uninterruptible sleeping, runnable, running. Which of them count as CPU load? In other words, which of them actually switch transistors and force the CPU act at the physical level (the fetch-decode-execute cycle)?

At the physical level, we have transistors making noise, only running instructions matter. At the software level, we have the kernel scheduler making decisions.

scheduler

This is how the kernel documentation describe the fairness of CFS:

In CFS the virtual runtime is expressed and tracked via the per-task p->se.vruntime (nanosec-unit) value.  This way, it's possible to accurately timestamp and measure the "expected CPU time" a task should have gotten.

Linux’s scheduler is interested if a task is entitled to CPU time. That’s what matters to CFS. The tasks that matter to the CFS are stored in the cfs_rq struct (sched/sched.h).

The Kernel checks the vruntime to see how much a task has run, and then decides if it should run now or later, compared to other tasks in the cfs_rq.

A task that is not running can still matter a lot to the scheduler¹. A task that runs briefly and blocks again may leave a visible footprint² in accounting long after it stopped executing instructions. Utilization, in this model, is related to scheduling pressure.

This is why loadavg³ considers running tasks + sleeping uninterruptible tasks when computing CPU load:

 * The global load average is an exponentially decaying average of nr_running +
 * nr_uninterruptible.

process states

Process states are not a description of what the CPU is doing. They describe what’s the internal state of a task in the kernel’s eyes.

A running task is executing instructions on a core. This is the only state that has a direct physical manifestation.

A runnable task is not running, but the kernel may choose it as the next running task. It sits in the runqueue and must be considered by the scheduler. Runnable does not mean “about to execute” - it means “ready to run” and “cannot be ignored by the Kernel” at the same time.

Blocked tasks split into two very different categoriess.

Interruptible sleeping tasks are effectively invisible to the scheduler. They do not compete for CPU time and do not contribute to scheduling pressure until something wakes them up.

Uninterruptible sleeping tasks are different. They are not executing instructions, but the scheduler must still account for them (the kernel uses the nr_uninterruptible counter to track them). They represent work that cannot be deferred or ignored, even though no instructions are currently running on their behalf.

A great representation of a uninterruptible task is this:

/**
 * wait_for_completion: - waits for completion of a task
 * @x:  holds the state of this particular completion
 *
 * This waits to be signaled for completion of a specific task. It is NOT
 * interruptible and there is no timeout.
 *
 * See also similar routines (i.e. wait_for_completion_timeout()) with timeout
 * and interrupt capability. Also see complete().
 */
void __sched wait_for_completion(struct completion *x)
{
	wait_for_common(x, MAX_SCHEDULE_TIMEOUT, TASK_UNINTERRUPTIBLE);
}
EXPORT_SYMBOL(wait_for_completion);

The kernel calls dequeue_task to sleep a task:

When a task is no longer runnable, this function is called to keep the corresponding scheduling entity out of the red-black tree. It decrements the nr_running variable.

When a task is sleeping, it pops out of the cfs_rq and is added to the a wait queue.

Stopped and interruptible sleeping tasks stay out of the runqueue because there’s nothing for the scheduler to do with them. While they’re sleeping, they don’t compete for CPU time, don’t add pressure, and don’t show up in loadavg. The interesting part is the wake-up. When many of them wake up at once, they all reenter the system as runnable, and that transition alone is enough to blow up the runqueue, scheduler utilization, and load average.

what’s actually running

At this point, it should be clear that there are two very different questions being asked:

which tasks are executing instructions right now?
how busy does the scheduler believe the CPU is?

The first question has a physical answer while the second does not. From the scheduler’s point of view, utilization is an estimate. It mixess what is happening now with what happened recently, and with what is likely to happen soon. This is explicit in this code commentary:

 * CPU utilization is the sum of running time of runnable tasks plus the
 * recent utilization of currently non-runnable tasks on that CPU.
 * It represents the amount of CPU capacity currently used by CFS tasks in
 * the range [0..max CPU capacity] with max CPU capacity being the CPU
 * capacity at f_max.
 *
 * The estimated CPU utilization is defined as the maximum between CPU
 * utilization and sum of the estimated utilization of the currently
 * runnable tasks on that CPU. It preserves a utilization "snapshot" of
 * previously-executed tasks, which helps better deduce how busy a CPU will
 * be when a long-sleeping task wakes up. The contribution to CPU utilization
 * of such a task would be significantly decayed at this point of time.

A sleeping task does not execute instructions. It does not sit in the runqueue. But its recent CPU history still exists as accounting data. That history fades with time, even if the task remains asleep, but it does not disappear immediately.

These two views coexisst:

the runqueue answers who can run now
utilization estimation answers how expensive tasks tend to be

This is why scheduler utilization can remain high even when few instructions are actually executing. The CPU may be mostly idle at the physical level, while the scheduler still carries memory of recent pressure.

The CPU is almost simple: it executes one thing after the other. The scheduler is a complex software construct, it’s an abstraction, and, like all abstractions, it creates new limitations.

https://docs.kernel.org/scheduler/schedutil.html ↩︎
It’s the util_avg in the sched_avg struct (linux/sched.h) ↩︎
If you are interested in loadavg, I loved reading this article by Brendan Gregg: https://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html ↩︎