Everything Penguin

Focusing on Linux-based Operating Systems
htDig Search:

Operating Systems
  • /pub/OS/Linux

  • Storage
  • File Systems
  • HPC
  • /pub/Storage

  • Networking
  • /pub/Networking

  • Network Services
  • /pub/NetworkServices

  • Security
  • /pub/Security
  • Keytool/OpenSSL

  • Clustering
  • HA
  • DRM

  • Development
  • Design
  • C/C++
  • Java
  • Perl
  • Python
  • Shell
  • Web / J2EE

  • Not Linux ?
  • BSD
  • HP-UX
  • Mac
  • Solaris
  • VM
  • Windows
  • /pub/OS

  • Other
  • /pub
  • /pub/3rdParty
  •  Parent Directory

    Linux Scheduling
    Brett Lee
    ====================================
    
    
    
    Process Scheduling on the CPU
    *********************************************************************
    
    
    How does scheduling work on Linux?
    ---------------------------------------
    
    Well, it depends upon both the scheduler and the scheduling policy.  A quick
    googling of Ingo Molnar and Con Kolivas will pull up a tale of scheduler
    development (and more), from the O(n) scheduler in 2.4 to the O(1) that was
    introduced in 2.6; from Con's RSDL, SD and BFS schedulers in 2.6 to Ingo's
    CFS scheduler.
    
    Assuming we're talking the 2.6 kernel, there are two (process) schedulers that
    are in the mainstream.  The first is the O(1) scheduler, named that because it
    has a fixed maximum time to find the next process to execute.  The other one,
    available from 2.6.23 on, is the Completely Fair Scheduler [CFS].
    
    
    
    Level Set
    ----------------
    
    Before we go into scheduler implementation specifics, a level set of basic
    Linux scheduling is in order.  Unless otherwise specified, this next section
    applies to the O(n) scheduler.
    
      In olden times, jobs were scheduled solely on the basis of a priority.  A
      user would start a job and it would start with an initial priority.  Some
      jobs naturally block for I/O and yield the CPU, but others might be CPU bound
      and not need to block for extended durations.  As the scheduler needed some
      way to implement time sharing (TS) and thus kick certain jobs off the CPU, the
      idea of a time quanta and decay was introduced.
      
      The decay of the process is regulated by time quanta.  Over each quantum of
      time, each process gets a fixed amount of CPU ( < 100% ).  Once the amount
      of CPU time is reached, the process is kicked off the CPU by decay (lowering
      of its priority), again and again, until all other runnable processes get
      their allotted CPU time for that quantum of time.  Then, the next quantum
      of time begins.  In this way, jobs all run on the CPU and can share time.
    
      But, all jobs running on the system shouldn't necessarily be given equal time.
      One way to resolve that was by assigning users a default job priority and
      allowing them to decrease the priority of some of their jobs, thereby 
      increasing the priority of their other jobs.  To keep things simple, user
      jobs start with a priority of 0 (see `ulimit -a`).  Users are then able to
      users can manipulate their jobs (using the nice and renice commands) within
      a range of 20 different priorities, ranging from 0 (the highest) to 19 (the
      lowest).  Thus, all typical user jobs run withing this 20 priority range.
      However, the TS class doesn't just consist of these run levels, it
      actually consists of 40 (in "nice" terms the range would be -20 to 19).
      Only root could either start or renice an existing process to a lower
      value, thereby increasing the priority of the process.  Each process has a
      predetermined priority level to run at - the argument to nice/renice is an
      offset to that predetermined priority.
    
      So, thus far we have 40 priority levels that jobs run in.  Well, some tasks,
      like the kernel scheduling algorthm, are clearly more important than those
      "run of the mill" jobs; thus a higher level of priority for those jobs is
      needed.  In practice, the is not just 1, but 100 priority levels above the
      40 that are used for typical jobs.  Thus, the kernel is actually aware of
      priorities 0 - 139.  What else could be needed, right?
    
      Well, what else could be needed was the ability to have some tasks scheduled
      using one algorithm and a different set of tasks by another algorithm.  To
      accomplish this, different scheduling policies were instituted.  When a job
      was started, it could request (using sched_setscheduler) the desired
      scheduling policy.  Jobs not written to take advantage of this feature could
      be started using a wrapper to have the same effect.  By default, jobs are
      started in the time sharing scheduling class SCHED_OTHER; these jobs run at
      the priorities between 100 and 139.  Jobs that needed to be run in a near
      real time fashion could be scheduled with the SCHED_RR policy.  Tasks running
      withing the SCHED_RR policy run in the top 100 and share time using a round
      robin approach.  The other class that runs in the top 100 is SCHED_FIFO.
      As the name implies, SCHED_FIFO is a FIFO policy that does not participate
      in round robin scheduling.  In addition, tasks running in SCHED_FIFO are not
      restricted by time slices, as they are in SCHED_RR.  A FIFO task will be
      preempted by a higher priority  task, but other than that it doesn't give up
      the CPU until it blocks, sleeps or exits.  Another scheduling policy was
      added in 2.6.16, SCHED_BATCH.  Tasks running in SCHED_BATCH policy run in
      the 40-range of priorities; but they always run with a set priority - 0.  In
      addition, tasks running in the SCHED_BATCH do not decay as do jobs running
      in SCHED_OTHER.  Thus, "batch" jobs are not interrupted as much and finish
      in a more dependable amount of time.
      
      A quick summary of the priority levels and scheduling classes are shown below:
    
      Priority 0-99        SCHED_FIFO (FIFO!) and SCHED_RR (real time round robin)
      Priority 100-139     SCHED_OTHER (time sharing)
      Priority 100-139     SCHED_BATCH *** - introduced in kernel 2.6.16
    
      For more, see:  `man sched_setscheduler`
    
    
      SUMMARY:
      ===========
    
      When a user runs a TS job, it will start in the priority range 120-139.
      TS jobs can be eleveated to the range 100-119, but only by root.
      Jobs in RR class run in the priority range 0-99 and obey time quanta.
      Jobs in FIFO class run in the priority range 0-99 for as long as they want,
        ( or until they block, get interrupted by an IRQ, ... )
    
    
      Scheduling classes presented have been: 
    
        First In First Out (SCHED_FIFO)
            System Priority Range: 0 through 99
    
        Real Time Round Robin (SCHED_RR)
            System Priority Range: 0 through 99
    
        Time Sharing (SCHED_OTHER)
            User Priority (nice) Range: -19 through 20
            System Priority Range: 100 through 139
    
        Batch (SCHED_BATCH)
            User Priority (nice) Range: 0 through 0
            System Priority Range: 120 - 120
    
    
    
    
    Now to our schedulers of interest:
    -------------------------------------
    
    O(n):
    
      The Linux 2.4 scheduler implemented a single runqueue for all tasks that
      were ready to run.  The runqueue was not a FIFO, but instead needed to be
      parsed each time a selection was to be made.  This meant that the longer
      the runqueue was, the longer the scheduler would take to make a decision.
      As the time to determine the next task was a function of the total number (n)
      of tasks, this scheduler earned the name O(n).
    
      The single runqueue was good for load balancing, as any task could be
      scheduled on any CPU.  However, on SMP machines this resulted in poor
      utilization of the L2 cache, as tasks that ran on CPU0 used a different
      cache when they ran on CPU2.
    
    
    O(1):
    
      One goal of the O(1) scheduler was to cut down on the time to select the
      next task to run, regardless of how many tasks were ready ready to run.
    
      Another goal was to make better use of the now prevelent SMP machines.
    
      The first goal was accomplished by creating multiple runqueues; one for each
      of the 140 priorities (0 - 139).  When it was time for the scheduler to find
      the next task, the scheduler simply had to look at the "active" runqueue for
      each priority, in descending order, and select the first task that it found.
    
      In reality, the O(1) actually created *two* runqueues for each priority; an
      active runqueue and an expired runqueue.  When a task is ready to run, it is
      added to the end of that priorities runqueue.  As is typical, each task has
      a time slice that determines how much time it is permitted to execute.  When
      a task on the active runqueue uses all of its time slice, it is move to the
      expired runqueue.  If no tasks exist on the active runqueue for a given
      period of time, the pointer to the active and expired runqueues are swapped,
      thus making the expired runqueue the active one.
    
      The second goal was creating not just one set of active and expired
      runqueues, but instead creating a set for each CPU.  As tasks are initially
      distributed evenly between CPUs, tasks would maintain a natural affinity for
      one processor and thus better utilize the L2 cache.
    
      ** Note:  This natural affinity did not prevent jobs from migrating from one
                CPU to another if dictatated by load conditions.
    
    
      This excerpt from the release notes suggests, as was the case with the O(n),
      that this scheduler would support a tuneable value to adjust between a
      desktop and a server install:
    
        Added a new kernel parameter: /proc/sys/kernel/sched_interactivity. This
        parameter allows you to tune the CPU scheduler's interactivity estimator. The
        interactivity estimator allows interactive processes to acquire more CPU time
        without causing CPU starvation in other processes.
    
        To configure this parameter, use:
          echo [interactivity_level] > /proc/sys/kernel/sched_interactivity
        where [interactivity_level] can be any of the following:
    
          * 2 . interactivity estimator is fully activated.
          * 1 . provides a weaker affinity to interactive processes than 2, but avoids
                CPU starvation under certain scheduling patterns.
          * 0 . any bias or affinity towards interactive processes is disabled. 
    
    
      **Note:  The doc points to:        /proc/sys/kernel/sched_interactivity
               But it seems to be here:  /proc/sys/kernel/sched_interactive
    
      Recall from above that SCHED_BATCH was added in the O(1).
    
      For more, see:
        Understanding the Linux Kernel, version 3
        http://www.fer.hr/_download/repository/scheduling_O(1).html
    
    
    
    Completely Fair Scheduler [CFS]:
    
      The goal of CFS is that it attempts to schedule tasks "fairer" to once
      again improve the interactivity on desktops.  Thus, tasks in SCHED_OTHER
      and SCHED_BATCH no longer are scheduled as in the previous schedulers.
    
      In order to accomplish this, there are no longer priorities or run queues
      in the time sharing and batch classes.  Instead, there is a red-black tree
      of tasks waiting to run.  Each task is "graded" on how unfair the scheduler
      has been to it (meaning, how much time it has *not* gotten on the CPU).
      When a task has been deemed to be the most unfair'd one, that task gets its
      turn to run.
    
      Unfairness values are updated every millisecond, so there certainly would
      be the potential for lots of context switching if there were not some way
      to prevent it.  To offset the context switching, there is a single tuneable
      for these classes of jobs:
    
             /proc/sys/kernel/sched_granularity_ns
    
        See: http://people.redhat.com/mingo/cfs-scheduler/sched-design-CFS.tx
    
      This can be used to tune the scheduler from 'desktop' (low latencies)
      to 'server' (good batching) workloads. The default appears to be a setting
      suitable for desktop workloads.
    
      Reports indicate that the kernel must be built with CONFIG_SCHED_DEBUG
      enabled in order to have the aforementioned tunable.  Why?  It doesn't
      seem like a debug feature but a necessary tuneable.  Perhaps that has
      changed as of this writing.
    
    
      You can read more on CFS here:
    
        http://oreilly.com/catalog/linuxkernel/chapter/ch10.html
        http://www.kernel.org/doc/#Process_Scheduler
        http://lwn.net/Articles/240474/
        http://www.ibm.com/developerworks/linux/library/l-completely-fair-scheduler/index.html
        http://www.ibm.com/developerworks/linux/library/l-scheduler/
        http://www.cs.unm.edu/~eschulte/data/bfs-v-cfs_groves-knockel-schulte.pdf
        http://kernelnewbies.org/Linux_2_6_23
        http://kernelnewbies.org/Documents/SchedulingInUNIXAndLinux
        http://www.xmailserver.org/linux-patches/lnxsched.html
        Books.google.com: Linux Device Drivers, Sreekrishnan Venkateswaran, p.555
    

    Other Sites

    RFC's
  • FAQ's
  • IETF
  • RFC Sourcebook

  • Linux
  • Linux - Intro
  • Linux Kernel
  • Linux Kernel (LKML)
  • Bash - Intro
  • Bash - Advanced
  • Command Line
  • System Administration
  • Network Administration
  • Man Pages (& more)
  • More Guides
  • Red Hat Manuals
  • HOWTO's

  • Reference/Tutorials
  • C++ @ cppreference
  • C++ @ cplusplus
  • CSS @ echoecho
  • DNS @ Zytrax
  • HTML @ W3 Schools
  • Java @ Sun
  • LDAP @ Zytrax
  • Linux @ YoLinux
  • MySQL
  • NetFilter
  • Network Protocols
  • OpenLDAP
  • Quagga
  • Samba
  • Unix Programming



  • This site contains many of my notes from research into different aspects of the Linux kernel as well as some of the software provided by GNU and others. Thouugh these notes are not fully comprehensive or even completetly accurate, they are part of my on-going attempt to better understand this complex field. And, they are your to use.

    Should you wish to report any errors or suggestions, please let me know.

    Should you wish to make a donation for anything you may have learned here, please direct that donation to the ASPCA, with my sincere thanks.

    Brett Lee
    Everything Penguin

    The code for this site, which is just a few CGI scripts, may be found on GitHub (https://github.com/userbrett/cgindex).

    For both data encryption and password protection, try Personal Data Security (https://www.trustpds.com).


    "We left all that stuff out. If there's an error, we have this routine called 'panic', and when its called, the machine crashes, and you holler down the hall, 'Hey, reboot it.'"

        - Dennis Ritchie on Unix (vs Multics)


    Google
    [ Powered by Red Hat Linux ] [ Powered by Apache Server] [ Powered by MySQL ]

    [ Statistics by AWStats ]