| Parent Directory|
Process Scheduling on the CPU
How does scheduling work on Linux?
Well, it depends upon both the scheduler and the scheduling policy. A quick
googling of Ingo Molnar and Con Kolivas will pull up a tale of scheduler
development (and more), from the O(n) scheduler in 2.4 to the O(1) that was
introduced in 2.6; from Con's RSDL, SD and BFS schedulers in 2.6 to Ingo's
Assuming we're talking the 2.6 kernel, there are two (process) schedulers that
are in the mainstream. The first is the O(1) scheduler, named that because it
has a fixed maximum time to find the next process to execute. The other one,
available from 2.6.23 on, is the Completely Fair Scheduler [CFS].
Before we go into scheduler implementation specifics, a level set of basic
Linux scheduling is in order. Unless otherwise specified, this next section
applies to the O(n) scheduler.
In olden times, jobs were scheduled solely on the basis of a priority. A
user would start a job and it would start with an initial priority. Some
jobs naturally block for I/O and yield the CPU, but others might be CPU bound
and not need to block for extended durations. As the scheduler needed some
way to implement time sharing (TS) and thus kick certain jobs off the CPU, the
idea of a time quanta and decay was introduced.
The decay of the process is regulated by time quanta. Over each quantum of
time, each process gets a fixed amount of CPU ( < 100% ). Once the amount
of CPU time is reached, the process is kicked off the CPU by decay (lowering
of its priority), again and again, until all other runnable processes get
their allotted CPU time for that quantum of time. Then, the next quantum
of time begins. In this way, jobs all run on the CPU and can share time.
But, all jobs running on the system shouldn't necessarily be given equal time.
One way to resolve that was by assigning users a default job priority and
allowing them to decrease the priority of some of their jobs, thereby
increasing the priority of their other jobs. To keep things simple, user
jobs start with a priority of 0 (see `ulimit -a`). Users are then able to
users can manipulate their jobs (using the nice and renice commands) within
a range of 20 different priorities, ranging from 0 (the highest) to 19 (the
lowest). Thus, all typical user jobs run withing this 20 priority range.
However, the TS class doesn't just consist of these run levels, it
actually consists of 40 (in "nice" terms the range would be -20 to 19).
Only root could either start or renice an existing process to a lower
value, thereby increasing the priority of the process. Each process has a
predetermined priority level to run at - the argument to nice/renice is an
offset to that predetermined priority.
So, thus far we have 40 priority levels that jobs run in. Well, some tasks,
like the kernel scheduling algorthm, are clearly more important than those
"run of the mill" jobs; thus a higher level of priority for those jobs is
needed. In practice, the is not just 1, but 100 priority levels above the
40 that are used for typical jobs. Thus, the kernel is actually aware of
priorities 0 - 139. What else could be needed, right?
Well, what else could be needed was the ability to have some tasks scheduled
using one algorithm and a different set of tasks by another algorithm. To
accomplish this, different scheduling policies were instituted. When a job
was started, it could request (using sched_setscheduler) the desired
scheduling policy. Jobs not written to take advantage of this feature could
be started using a wrapper to have the same effect. By default, jobs are
started in the time sharing scheduling class SCHED_OTHER; these jobs run at
the priorities between 100 and 139. Jobs that needed to be run in a near
real time fashion could be scheduled with the SCHED_RR policy. Tasks running
withing the SCHED_RR policy run in the top 100 and share time using a round
robin approach. The other class that runs in the top 100 is SCHED_FIFO.
As the name implies, SCHED_FIFO is a FIFO policy that does not participate
in round robin scheduling. In addition, tasks running in SCHED_FIFO are not
restricted by time slices, as they are in SCHED_RR. A FIFO task will be
preempted by a higher priority task, but other than that it doesn't give up
the CPU until it blocks, sleeps or exits. Another scheduling policy was
added in 2.6.16, SCHED_BATCH. Tasks running in SCHED_BATCH policy run in
the 40-range of priorities; but they always run with a set priority - 0. In
addition, tasks running in the SCHED_BATCH do not decay as do jobs running
in SCHED_OTHER. Thus, "batch" jobs are not interrupted as much and finish
in a more dependable amount of time.
A quick summary of the priority levels and scheduling classes are shown below:
Priority 0-99 SCHED_FIFO (FIFO!) and SCHED_RR (real time round robin)
Priority 100-139 SCHED_OTHER (time sharing)
Priority 100-139 SCHED_BATCH *** - introduced in kernel 2.6.16
For more, see: `man sched_setscheduler`
When a user runs a TS job, it will start in the priority range 120-139.
TS jobs can be eleveated to the range 100-119, but only by root.
Jobs in RR class run in the priority range 0-99 and obey time quanta.
Jobs in FIFO class run in the priority range 0-99 for as long as they want,
( or until they block, get interrupted by an IRQ, ... )
Scheduling classes presented have been:
First In First Out (SCHED_FIFO)
System Priority Range: 0 through 99
Real Time Round Robin (SCHED_RR)
System Priority Range: 0 through 99
Time Sharing (SCHED_OTHER)
User Priority (nice) Range: -19 through 20
System Priority Range: 100 through 139
User Priority (nice) Range: 0 through 0
System Priority Range: 120 - 120
Now to our schedulers of interest:
The Linux 2.4 scheduler implemented a single runqueue for all tasks that
were ready to run. The runqueue was not a FIFO, but instead needed to be
parsed each time a selection was to be made. This meant that the longer
the runqueue was, the longer the scheduler would take to make a decision.
As the time to determine the next task was a function of the total number (n)
of tasks, this scheduler earned the name O(n).
The single runqueue was good for load balancing, as any task could be
scheduled on any CPU. However, on SMP machines this resulted in poor
utilization of the L2 cache, as tasks that ran on CPU0 used a different
cache when they ran on CPU2.
One goal of the O(1) scheduler was to cut down on the time to select the
next task to run, regardless of how many tasks were ready ready to run.
Another goal was to make better use of the now prevelent SMP machines.
The first goal was accomplished by creating multiple runqueues; one for each
of the 140 priorities (0 - 139). When it was time for the scheduler to find
the next task, the scheduler simply had to look at the "active" runqueue for
each priority, in descending order, and select the first task that it found.
In reality, the O(1) actually created *two* runqueues for each priority; an
active runqueue and an expired runqueue. When a task is ready to run, it is
added to the end of that priorities runqueue. As is typical, each task has
a time slice that determines how much time it is permitted to execute. When
a task on the active runqueue uses all of its time slice, it is move to the
expired runqueue. If no tasks exist on the active runqueue for a given
period of time, the pointer to the active and expired runqueues are swapped,
thus making the expired runqueue the active one.
The second goal was creating not just one set of active and expired
runqueues, but instead creating a set for each CPU. As tasks are initially
distributed evenly between CPUs, tasks would maintain a natural affinity for
one processor and thus better utilize the L2 cache.
** Note: This natural affinity did not prevent jobs from migrating from one
CPU to another if dictatated by load conditions.
This excerpt from the release notes suggests, as was the case with the O(n),
that this scheduler would support a tuneable value to adjust between a
desktop and a server install:
Added a new kernel parameter: /proc/sys/kernel/sched_interactivity. This
parameter allows you to tune the CPU scheduler's interactivity estimator. The
interactivity estimator allows interactive processes to acquire more CPU time
without causing CPU starvation in other processes.
To configure this parameter, use:
echo [interactivity_level] > /proc/sys/kernel/sched_interactivity
where [interactivity_level] can be any of the following:
* 2 . interactivity estimator is fully activated.
* 1 . provides a weaker affinity to interactive processes than 2, but avoids
CPU starvation under certain scheduling patterns.
* 0 . any bias or affinity towards interactive processes is disabled.
**Note: The doc points to: /proc/sys/kernel/sched_interactivity
But it seems to be here: /proc/sys/kernel/sched_interactive
Recall from above that SCHED_BATCH was added in the O(1).
For more, see:
Understanding the Linux Kernel, version 3
Completely Fair Scheduler [CFS]:
The goal of CFS is that it attempts to schedule tasks "fairer" to once
again improve the interactivity on desktops. Thus, tasks in SCHED_OTHER
and SCHED_BATCH no longer are scheduled as in the previous schedulers.
In order to accomplish this, there are no longer priorities or run queues
in the time sharing and batch classes. Instead, there is a red-black tree
of tasks waiting to run. Each task is "graded" on how unfair the scheduler
has been to it (meaning, how much time it has *not* gotten on the CPU).
When a task has been deemed to be the most unfair'd one, that task gets its
turn to run.
Unfairness values are updated every millisecond, so there certainly would
be the potential for lots of context switching if there were not some way
to prevent it. To offset the context switching, there is a single tuneable
for these classes of jobs:
This can be used to tune the scheduler from 'desktop' (low latencies)
to 'server' (good batching) workloads. The default appears to be a setting
suitable for desktop workloads.
Reports indicate that the kernel must be built with CONFIG_SCHED_DEBUG
enabled in order to have the aforementioned tunable. Why? It doesn't
seem like a debug feature but a necessary tuneable. Perhaps that has
changed as of this writing.
You can read more on CFS here:
Books.google.com: Linux Device Drivers, Sreekrishnan Venkateswaran, p.555
This site contains many of my notes from research into different aspects of
the Linux kernel as well as some of the software provided by GNU and others.
Thouugh these notes are not fully comprehensive or even completetly accurate,
they are part of my on-going attempt to better understand this complex field.
And, they are your to use.
Should you wish to report any errors or suggestions, please let me know.
Should you wish to make a donation for anything you may have learned here,
please direct that donation to the
with my sincere thanks.
The code for this site, which is just a few CGI scripts, may be found on GitHub
For both data encryption and password protection, try Personal Data Security
"We left all that stuff out. If there's an error, we have this
routine called 'panic', and when its called, the machine crashes,
and you holler down the hall, 'Hey, reboot it.'"
- Dennis Ritchie on Unix (vs Multics)