Everything Penguin

Focusing on Linux-based Operating Systems
htDig Search:

Operating Systems
  • /pub/OS/Linux

  • Storage
  • File Systems
  • HPC
  • /pub/Storage

  • Networking
  • /pub/Networking

  • Network Services
  • /pub/NetworkServices

  • Security
  • /pub/Security
  • Keytool/OpenSSL

  • Clustering
  • HA
  • DRM

  • Development
  • Design
  • C/C++
  • Java
  • Perl
  • Python
  • Shell
  • Web / J2EE

  • Not Linux ?
  • BSD
  • HP-UX
  • Mac
  • Solaris
  • VM
  • Windows
  • /pub/OS

  • Other
  • /pub
  • /pub/3rdParty
  •  Parent Directory

    Congestion Avoidance on the Internet (IPv4 related)
    Brett Lee
    ================================================================================
    
    
    
    History:
    ---------------
    
    October '86 - first of a series of "Internet meltdowns" happened
    (technically called "congestion collapse").  Everyone trying
    to send got out of control, so needed a method to regulate traffic
    on the Internet.
    
    The first (of many) solution was by Jacobson/Karels.  They modified
    BSD 4.3 with:  
    i) round-trip-time variance estimation,
    ii) exponential retransmit timer backoff, 
    iii) slow start, 
    iv) more aggressive receiver ACK policy, 
    v) dynamic window sizing on congestion, 
    vi) retransmit backoff, and 
    vii) fast retransmits.
    
    And so it began...
    
    
    
    RFC 2001
    ---------------
    
    RFC 2001 discusses some methods for congestion avoidance that are
    implemented by not well documented.  Mostly it discusses "endpoint"
    congestion avoidance mechanisms of edge routers and has references
    to the historical earlier works.
    
     - Slow start algorithm:
       Receiver has a advertised window (how much to send it w/o ACK)
       Sender has a congestion window (cwnd), initially set to one segment.
       With successful transmits (ACK received), cwnd doubles.  A
       second window, slow start threshold, remains at the previous value.
    
     - Congestion avoidance:
       Doubling is great for awhile, but it quickly grows too much
       Congestion avoidance introduces another variable, "sshthresh", which
         acts as a slow start threshold (where the doubling stops and 
         linear increases begin).  Once sshthresh is reached OR a segment
         is NOT acknowledged, increases in cwnd are linear (1 segment).
    
     - Fast retransmit:
       Rather than waiting for the retransmit timer to expire, a TCP 
       implementation can retransmit after 3 or more duplicate ACK's
       (ACK's with the next expected sequence number set to the "lost" segment)
       are received in a row.  In this case, TCP thinks the packet really is 
       lost and sends another right away (rather than waiting for the segment
       timer to expire).
    
     - Fast recovery:
       A way to get transmission going at a high rate realizing that
       there are still packets flowing but for some reason one got lost.
       In slow start, increases are exponential (doubling).  Once a segment
       is not acknowledged, the cwnd resets NOT to one segment but to the
       current value of the slow start threshold.  
    
    
    
    RFC 2309
    ---------------
    
    Up to now we see that TCP typically increases the windowing until 
    congestion is detected, where dropped packets indicate congestion.
    While this is OK with some protocols, it is NOT ok with others.
    Need a method to implement Random Early Detection (RED).
    
    Routers have a queue of frames to be processed.  Typically, congestion
    was managed by setting a maximal queue size.  If the queue grew too
    large, packets would be discarded.  Rather than simply dropping
    packets once a threshold was passed, it would be nice to throttle
    down the source(s) of the incoming packets.  One method to 
    implement RED is with Active Queue Management (AQM).
    
    * Enter:  http://www.faqs.org/rfcs/rfc2309.html
      Recommendations on Queue Management and Congestion Avoidance
    
    2309 makes some suggestions on the topic, recommending that Internet
    routers are built with some sort of RED that better manages queue
    lengths so that packets do not have to be dropped but instead the
    flow of packets is managed better.
    
    
    
    RFC 2873/2475/2474/1349/793/791
    --------------------------------------
    
    As mentioned in 2309, there are (at least) two classes of router
    algorithms: "queue management" and "scheduling".  Queue management
    relates to managing the length of the queue by dropping packets, whereas
    scheduling relates to the order in which packets in the queue are 
    delivered.  Ignoring queue management (what we really want to talk
    about) for a moment and focusing instead on scheduling (which goes
    hand in hand), we first need to jump into the Way Back machine to
    circa 1981 and the precedence field in the IP header Type of Service.
    
    
    Per RFC 791:
    
      September 1981                                       Internet Protocol
                                                               Specification
      Type of Service:  8 bits
    
        The Type of Service provides an indication of the abstract
        parameters of the quality of service desired.  These parameters are
        to be used to guide the selection of the actual service parameters
        when transmitting a datagram through a particular network.  Several
        networks offer service precedence, which somehow treats high
        precedence traffic as more important than other traffic (generally
        by accepting only traffic above a certain precedence at time of high
        load).  The major choice is a three way tradeoff between low-delay,
        high-reliability, and high-throughput.
    
          Bits 0-2:  Precedence.
          Bit    3:  0 = Normal Delay,       1 = Low Delay.
          Bits   4:  0 = Normal Throughput,  1 = High Throughput.
          Bits   5:  0 = Normal Reliability, 1 = High Reliability.
          Bit  6-7:  Reserved for Future Use.
    
             0     1     2     3     4     5     6     7
          +-----+-----+-----+-----+-----+-----+-----+-----+
          |                 |     |     |     |     |     |
          |   PRECEDENCE    |  D  |  T  |  R  |  0  |  0  |
          |                 |     |     |     |     |     |
          +-----+-----+-----+-----+-----+-----+-----+-----+
    
            Precedence
    
              111 - Network Control
              110 - Internetwork Control
              101 - CRITIC/ECP
              100 - Flash Override
              011 - Flash
              010 - Immediate
              001 - Priority
              000 - Routine
    
        The use of the Delay, Throughput, and Reliability indications may
        increase the cost (in some sense) of the service.  In many networks
        better performance for one of these parameters is coupled with worse
        performance on another.  Except for very unusual cases at most two
        of these three indications should be set.
    
        The type of service is used to specify the treatment of the datagram
        during its transmission through the internet system.  Example
        mappings of the internet type of service to the actual service
        provided on networks such as AUTODIN II, ARPANET, SATNET, and PRNET
        is given in "Service Mappings".
    
        The Network Control precedence designation is intended to be used
        within a network only.  The actual use and control of that
        designation is up to each network. The Internetwork Control
        designation is intended for use by gateway control originators only.
        If the actual use of these precedence designations is of concern to
        a particular network, it is the responsibility of that network to
        control the access to, and use of, those precedence designations.
    
    
    
    
    Similarly, per RFC 793, TCP also has a precedence value.
    
      September 1981                           Transmission Control Protocol
                                                    Functional Specification
    
      There are also some variables used frequently in the discussion that
      take their values from the fields of the current segment.
    
        Current Segment Variables
    
          SEG.SEQ - segment sequence number
          SEG.ACK - segment acknowledgment number
          SEG.LEN - segment length
          SEG.WND - segment window
          SEG.UP  - segment urgent pointer
          SEG.PRC - segment precedence value
    
    
    
    The TCP precedence value is later obsoleted in both SYN and SYN/ACK 
    packets by RFC 2873 and interop with DiffServ.
    
      RFC 2873         TCP and the IPv4 Precedence Field           June 2000
    
      Proposed Modification to TCP
    
      The proposed modification to TCP is that TCP must ignore the
      precedence of all received segments. More specifically:
    
      (1) In TCP's synchronization process, the TCP modules at both ends
      must ignore the precedence fields of the SYN and SYN ACK packets. The
      TCP connection will be established if all the conditions specified by
      RFC 793 are satisfied except the precedence of the connection.
    
      (2) After a connection is established, each end sends segments with
      its desired precedence. The precedence picked by one end of the TCP
      connection may be the same or may be different from the precedence
      picked by the other end (because precedence is ignored during
      connection setup time). The precedence fields may be changed by the
      intermediate nodes too. In either case, the precedence of the
      received packets will be ignored by the other end. The TCP connection
      will not be reset in either case.
    
    
    
    
    Anyway, discussed earlier in RFC 791 was the precedence bits of the TOS
    field. In RFC 1349, the IP remaining four bits in the TOS field were
    clarified and made mandatory:
    
       RFC 1349                 Type of Service                    July 1992
    
       Specification of the TOS Field
    
       As was stated just above, this memo redefines the TOS field as a four
       bit field.  Also contrary to RFC-791, this memo defines the TOS field
       as a single enumerated value rather than as a set of bits (where each
       bit has its own meaning).  This memo defines the semantics of the
       following TOS field values (expressed as binary numbers):
    
                        1000   --   minimize delay
                        0100   --   maximize throughput
                        0010   --   maximize reliability
                        0001   --   minimize monetary cost
                        0000   --   normal service
    
    
    
    So now we have all 8 bits being used to better prioritize IP packets over
    the Internet (last bit must be Zero).  And what happens next, a better way
    to prioritize IP packets over the Internet:  Diffserv !!!
    
    OK, so if you're like me, you're starting to think that this one little
    byte is gettina a lot of attention.  You're right, it keeps evolving over
    the years.  Lets take a quick review so that Diffserv can be put into
    perspective
    
    
       RFC 791                                                September 1981
    
             0     1     2     3     4     5     6     7
          +-----+-----+-----+-----+-----+-----+-----+-----+
          |   PRECEDENCE    |  D  |  T  |  R  |  -  |  -  |
          +-----+-----+-----+-----+-----+-----+-----+-----+
    
          Bits 0-2:  Precedence.
          Bit    3:  0 = Normal Delay,       1 = Low Delay.
          Bit    4:  0 = Normal Throughput,  1 = High Throughput.
          Bit    5:  0 = Normal Reliability, 1 = High Reliability.
          Bits 6-7:  Reserved for Future Use.
    
    
    
       RFC 1349                 Type of Service                    July 1992
    
             0     1     2     3     4     5     6     7
          +-----+-----+-----+-----+-----+-----+-----+-----+
          |   PRECEDENCE    |  D  |  T  |  R  |  C  |  0  |
          +-----+-----+-----+-----+-----+-----+-----+-----+
    
          Bits 0-2:  Precedence.
          Bit    3:  Minimize delay
          Bit    4:  Maximize throughput
          Bit    5:  Maximize reliability
          Bit    5:  Minimize monetary cost
          Bit    7:  Must be zero (MBZ)
    
    
    
    The next two, Assured Forwarding and Expedited Forwarding are Diffserv.  As
    you can see, the Differentiated Services code point (DSCP) uses the first six
    bits.  The last two remain unused:
    
    
    
    RFC 2597              Assured Forwarding PHB Group             June 1999
    RFC 2598              An Expedited Forwarding PHB              June 1999
    
             0     1     2     3     4     5     6     7
          +-----+-----+-----+-----+-----+-----+-----+-----+
          |         DSCP ( AF or EF )         |  -  |  -  |
          +-----+-----+-----+-----+-----+-----+-----+-----+
    
    
    
         AF defines four (4) different classes of traffic and three (3) 
         different drop precedences for each class:
    
                            Class 1    Class 2    Class 3    Class 4
                          +----------+----------+----------+----------+
         Low Drop Prec    |  001010  |  010010  |  011010  |  100010  |
         Medium Drop Prec |  001100  |  010100  |  011100  |  100100  |
         High Drop Prec   |  001110  |  010110  |  011110  |  100110  |
                          +----------+----------+----------+----------+
    
    
    
    
         In EF, it is much simpler.  If EF is set, it equals 101110.  There is 
         only one value.  Packets with the 46 (101110) codepoint are guaranteed
         preferential treatment by all diffserv routers en route to the packets'
         destination (which differs from AF, which can vary from Diffserv cloud
         to Diffserv cloud).
    
         This value really indicates how long to keep the packet in the queue.  
         Looking back to the Application Precedence (first 3 bits in RFC 791 - 
         remember way back then?), the highest value an application could be set 
         was 5 (101).  As it turns out, the EF value of "101110" is backward 
         compatible with TOS and the IP precedence values.  In fact, AF is also
         as its values do not surpass 5 (101).
         
    
      * Definitions of Differentiated Services (DiffServ) - RFC 2474.
      * An Architecture for Differentiated Services (DiffServ) - RFC 2475.
      * Differentiated Services - Assured Forwarding (AF) - RFC 2597.
      * Differentiated Services - Expedited Forwarding (EF) - RFC 2598.
        http://www.faqs.org/rfcs/rfc2873.html  <-- repeated for good measure
    
    
    
    
    RFC 2481
    ---------------
    
    Remember earlier when we were talking about queue management and
    went on that detour through IP datagram prioritization and services?
    Well, were finally back and talking about Advanced Queue Management (AQM).
    
    Finally...
    
    * Enter:  http://www.faqs.org/rfcs/rfc2481.html
      A proposal to add Explicit Congestion Notification (ECN) to IP
    
       RFC 2481 introduced AQM as a proposal to add experimental Explicit
       Congestion Notification (ECN) between routers when they detect congestion.
       The idea being that instead of simply dropping packets, routers can instead
       set the Congestion Experienced (CE) codepoint in the IP header of packets
       from ECN-capable transports.
    
       Bits 6 and 7 in the IPv4 TOS octet are designated as the ECN field.
       Bit 6 is designated as the ECT bit, and bit 7 is designated as the CE
       bit.  The IPv4 TOS octet corresponds to the Traffic Class octet in
       IPv6.  That means, continuing our drawing from above, that we now
       use the last two unused bits.
    
    
             0     1     2     3     4     5     6     7
          +-----+-----+-----+-----+-----+-----+-----+-----+
          |         DSCP ( AF or EF )         |    ECN    |
          +-----+-----+-----+-----+-----+-----+-----+-----+
    
    
       The ECN-Capable Transport (ECT) bit would be set by the data sender to 
       indicate that the end-points of the transport protocol
       are ECN-capable.  The CE bit would be set by the router to indicate
       congestion to the end nodes.  Routers that have a packet arriving at
       a full queue would drop the packet, just as they do now.
    
    
    
    
    RFC 3168
    ---------------
    
    The ECN "experiment" is a success.  Time to put in on the "standards track".
    
    * Enter:  http://www.faqs.org/rfcs/rfc3168.html
      Addition of Explicit Congestion Notification (ECN) to IP.
    
    This Obsoletes RFC 2309 - protocol is no longer experimental:
    
       This document obsoletes RFC 2481, "A Proposal to add Explicit
       Congestion Notification (ECN) to IP", which defined ECN as an
       Experimental Protocol for the Internet Community.  This document also
       updates RFC 2474, "Definition of the Differentiated Services Field
       (DS Field) in the IPv4 and IPv6 Headers", in defining the ECN field
       in the IP header, RFC 2401, "Security Architecture for the Internet
       Protocol" to change the handling of IPv4 TOS Byte and IPv6 Traffic
       Class Octet in tunnel mode header construction to be compatible with
       the use of ECN, and RFC 793, "Transmission Control Protocol", in
       defining two new flags in the TCP header.
    
    Well, we just couldn't get away with out changing things one more time.
    Here are the latest bit values for the two ECN bits.
    
          +-----+-----+
          | ECN FIELD |
          +-----+-----+
            ECT   CE         [Obsolete] RFC 2481 names for the ECN bits.
             0     0         Not-ECT
             0     1         ECT(1)
             1     0         ECT(0)
             1     1         CE
    
    
    RFC ####
    ---------------
    
    I really don't think this is the end of things... :)
    

    Other Sites

    RFC's
  • FAQ's
  • IETF
  • RFC Sourcebook

  • Linux
  • Linux - Intro
  • Linux Kernel
  • Linux Kernel (LKML)
  • Bash - Intro
  • Bash - Advanced
  • Command Line
  • System Administration
  • Network Administration
  • Man Pages (& more)
  • More Guides
  • Red Hat Manuals
  • HOWTO's

  • Reference/Tutorials
  • C++ @ cppreference
  • C++ @ cplusplus
  • CSS @ echoecho
  • DNS @ Zytrax
  • HTML @ W3 Schools
  • Java @ Sun
  • LDAP @ Zytrax
  • Linux @ YoLinux
  • MySQL
  • NetFilter
  • Network Protocols
  • OpenLDAP
  • Quagga
  • Samba
  • Unix Programming



  • This site contains many of my notes from research into different aspects of the Linux kernel as well as some of the software provided by GNU and others. Thouugh these notes are not fully comprehensive or even completetly accurate, they are part of my on-going attempt to better understand this complex field. And, they are your to use.

    Should you wish to report any errors or suggestions, please let me know.

    Should you wish to make a donation for anything you may have learned here, please direct that donation to the ASPCA, with my sincere thanks.

    Brett Lee
    Everything Penguin

    The code for this site, which is just a few CGI scripts, may be found on GitHub (https://github.com/userbrett/cgindex).

    For both data encryption and password protection, try Personal Data Security (https://www.trustpds.com).


    "We left all that stuff out. If there's an error, we have this routine called 'panic', and when its called, the machine crashes, and you holler down the hall, 'Hey, reboot it.'"

        - Dennis Ritchie on Unix (vs Multics)


    Google
    [ Powered by Red Hat Linux ] [ Powered by Apache Server] [ Powered by MySQL ]

    [ Statistics by AWStats ]