Identifying Modified
Explicit Congestion Notification (ECN) Semantics for Ultra-Low Queuing
Delay
Nokia Bell Labs
Antwerp
Belgium
koen.de_schepper@nokia.com
https://www.bell-labs.com/usr/koen.de_schepper
CableLabs
UK
ietf@bobbriscoe.net
http://bobbriscoe.net/
Transport
Transport Services (tsv)
Internet-Draft
I-D
This specification defines the identifier to be used on IP packets
for a new network service called low latency, low loss and scalable
throughput (L4S). It is similar to the original (or 'Classic') Explicit
Congestion Notification (ECN). 'Classic' ECN marking was required to be
equivalent to a drop, both when applied in the network and when
responded to by a transport. Unlike 'Classic' ECN marking, for packets
carrying the L4S identifier, the network applies marking more
immediately and more aggressively than drop, and the transport response
to each mark is reduced and smoothed relative to that for drop. The two
changes counterbalance each other so that the throughput of an L4S flow
will be roughly the same as a 'Classic' flow under the same conditions.
However, the much more frequent control signals and the finer responses
to them result in ultra-low queuing delay without compromising link
utilization, even during high load. Examples of new active queue
management (AQM) marking algorithms and examples of new transports
(whether TCP-like or real-time) are specified separately. The new L4S
identifier is the key piece that enables them to interwork and
distinguishes them from 'Classic' traffic. It gives an incremental
migration path so that existing 'Classic' TCP traffic will be no worse
off, but it can be prevented from degrading the ultra-low delay and loss
of the new scalable transports.
This specification defines the identifier to be used on IP packets
for a new network service called low latency, low loss and scalable
throughput (L4S). It is similar to the original (or 'Classic') Explicit
Congestion Notification (ECN). 'Classic' ECN marking was required to be
equivalent to a drop, both when applied in the network and when
responded to by a transport. Unlike 'Classic' ECN marking, the network
applies L4S marking more immediately and more aggressively than drop,
and the transport response to each mark is reduced and smoothed relative
to that for drop. The two changes counterbalance each other so that the
bit-rate of an L4S flow will be roughly the same as a 'Classic' flow
under the same conditions. Nonetheless, the much more frequent control
signals and the finer responses to them result in ultra-low queuing
delay without compromising link utilization, even during high load.
An example of an active queue management (AQM) marking algorithm that
enables the L4S service is the DualQ Coupled AQM defined in a
complementary specification . An example of a scalable
transport that would enable the L4S service is Data Centre TCP (DCTCP),
which until now has been applicable solely to controlled environments
like data centres , because it is too aggressive
to co-exist with existing TCP. However, AQMs like DualQ Coupled enable
scalable transports like DCTCP to co-exist with existing traffic, each
getting roughly the same flow rate when they compete under similar
conditions. Note that DCTCP will still not be safe to deploy on the
Internet until it satisfies the requirements listed in .
The new L4S identifier is the key piece that enables these two parts
to interwork and distinguishes them from 'Classic' traffic. It gives an
incremental migration path so that existing 'Classic' TCP traffic will
be no worse off, but it can be prevented from degrading the ultra-low
delay and loss of the new scalable transports. The performance
improvement is so great that it is hoped it will motivate initial
deployment of the separate parts of this system.
Latency is becoming the critical performance factor for many
(most?) applications on the public Internet, e.g. Web, voice,
conversational video, gaming, finance apps, remote desktop and
cloud-based applications. In the developed world, further increases in
access network bit-rate offer diminishing returns, whereas latency is
still a multi-faceted problem. In the last decade or so, much has been
done to reduce propagation time by placing caches or servers closer to
users. However, queuing remains a major component of latency.
The Diffserv architecture provides Expedited Forwarding , so that low latency traffic can jump the queue of
other traffic. However, on access links dedicated to individual sites
(homes, small enterprises or mobile devices), often all traffic at any
one time will be latency-sensitive. Then Diffserv is of little use.
Instead, we need to remove the causes of any unnecessary delay.
The bufferbloat project has shown that excessively-large buffering
('bufferbloat') has been introducing significantly more delay than the
underlying propagation time. These delays appear only
intermittently—only when a capacity-seeking (e.g. TCP) flow is
long enough for the queue to fill the buffer, making every packet in
other flows sharing the buffer sit through the queue.
Active queue management (AQM) was originally developed to solve
this problem (and others). Unlike Diffserv, which gives low latency to
some traffic at the expense of others, AQM controls latency for all traffic in a class. In general, AQMs
introduce an increasing level of discard from the buffer the longer
the queue persists above a shallow threshold. This gives sufficient
signals to capacity-seeking (aka. greedy) flows to keep the buffer
empty for its intended purpose: absorbing bursts. However,
RED and other algorithms from the 1990s
were sensitive to their configuration and hard to set correctly. So,
AQM was not widely deployed.
More recent state-of-the-art AQMs, e.g. fq_CoDel , PIE , Adaptive RED , are easier to configure, because they define the
queuing threshold in time not bytes, so it is invariant for different
link rates. However, no matter how good the AQM, the sawtoothing rate
of TCP will either cause queuing delay to vary or cause the link to be
under-utilized. Even with a perfectly tuned AQM, the additional
queuing delay will be of the same order as the underlying
speed-of-light delay across the network. Flow-queuing can isolate one
flow from another, but it cannot isolate a TCP flow from the delay
variations it inflicts on itself, and it has other problems - it
overrides the flow rate decisions of variable rate video applications,
it does not recognise the flows within IPSec VPN tunnels and it is
relatively expensive to implement.
Latency is not our only concern: It was known when TCP was first
developed that it would not scale to high bandwidth-delay products.
Given regular broadband bit-rates over WAN distances are
already beyond the scaling range of
'Classic' TCP Reno, 'less unscalable' Cubic and Compound variants of TCP have been
successfully deployed. However, these are now approaching their
scaling limits. Unfortunately, fully scalable TCPs such as DCTCP cause 'Classic' TCP to starve itself, which is why
they have been confined to private data centres or research testbeds
(until now).
It turns out that a TCP algorithm like DCTCP that solves TCP's
scalability problem also solves the latency problem, because the finer
sawteeth cause very little queuing delay. A supporting paper gives the full explanation of why the design solves
both the latency and the scaling problems, both in plain English and
in more precise mathematical form. The explanation is summarised
without the maths in the L4S architecture document .
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as described in
. In this document, these words will appear
with that interpretation only when in ALL CAPS. Lower case uses of
these words are not to be interpreted as carrying RFC-2119
significance.
The 'Classic' service is intended
for all the behaviours that currently co-exist with TCP Reno (e.g.
TCP Cubic, Compound, SCTP, etc).
The
'L4S' service is intended for traffic from scalable TCP algorithms
such as Data Centre TCP. But it is also more general—it will
allow a set of congestion controls with similar scaling properties
to DCTCP (e.g. Relentless ) to
evolve.Both Classic and L4S services can
cope with a proportion of unresponsive or less-responsive traffic
as well (e.g. DNS, VoIP, etc).
The original Explicit Congestion
Notification (ECN) protocol .
The new L4S identifier defined in this specification is applicable
for IPv4 and IPv6 packets (as for classic ECN ). It is applicable for the unicast, multicast and
anycast forwarding modes. It is an orthogonal packet classification to
Differentiated Services (Diffserv ), therefore
it can be applied to any packet in any Diffserv traffic class.
However, as with classic ECN, any particular forwarding node might not
implement an active queue management algorithm in all its Diffserv
queues.
This document is intended for experimental status, so it does not
update any standards track RFCs. Therefore it depends on , which:
updates the ECN proposed standard (in
certain specified cases including the present document) to relax
the requirement that an ECN mark must be equivalent to a drop,
both when applied by the network, and when responded to by the
sender;
makes the case for changing the status of the experimental ECN
nonce to historic (see for rationale);
makes consequent updates to the following proposed standard
RFCs to reflect the above two bullets:
ECN for RTP ;
the congestion control specifications of various DCCP
congestion control identifier (CCID) profiles , , .
This subsection briefly records the process that led to a consensus
choice of L4S identifier, selected from all the alternatives in .
Ideally, the identifier for packets using the Low Latency, Low
Loss, Scalable throughput (L4S) service ought to meet the following
requirements:
it SHOULD survive end-to-end between source and destination
applications: across the boundary between host and network,
between interconnected networks, and through middleboxes;
it SHOULD be common to IPv4 and IPv6 and transport
agnostic;
it SHOULD be incrementally deployable;
it SHOULD enable an AQM to classify packets encapsulated by
outer IP or lower-layer headers;
it SHOULD consume minimal extra codepoints;
it SHOULD not lead to some packets of a transport-layer flow
being served by a different queue from others.
Whether the identifier would be recoverable if the experiment
failed is a factor that could be taken into account. However, this has
not been made a requirement, because that would favour schemes that
would be easier to fail, rather than those more likely to succeed.
It is recognised that the chosen identifier is unlikely to satisfy
all these requirements, particularly given the limited space left in
the IP header. Therefore a compromise will be necessary, which is why
all the requirements are expressed with the word 'SHOULD' not 'MUST'.
discusses the pros and cons of the
compromises made in various competing identification schemes against
the above requirements.
On the basis of this analysis, "ECT(1) and CE codepoints" is the
best compromise. Therefore this scheme is defined in detail in the
following sections, while records the
rationale for this decision.
The L4S treatment is an alternative packet marking treatment to the classic ECN treatment . Like classic ECN, it identifies both network and
host behaviour: it identifies the marking treatment that network nodes
are expected to apply to L4S packets, and it identifies packets that
have been sent from hosts that are expected to comply with a broad
type of behaviour.
For a packet to receive L4S treatment as it is forwarded, the
sender MUST set the ECN field in the IP header (v4 or v6) to the
ECT(1) codepoint.
A network node that implements the L4S service MUST classify
arriving ECT(1) packets for L4S treatment and it SHOULD classify
arriving CE packets for L4S treatment as well. describes a possible
exception to this latter rule.
An L4S AQM treatment follows similar codepoint transition rules to
those in RFC 3168. Specifically, the ECT(1) codepoint MUST NOT be
changed to any other codepoint than CE, and CE MUST NOT be changed to
any other codepoint. An ECT(1) packet is classified as ECN-capable
and, if congestion increases, an L4S AQM algorithm will mark the ECN
field as CE for an increasing proportion of packets, otherwise
forwarding packets unchanged as ECT(1). Necessary conditions for an
L4S marking treatment are defined in .
Under persistent overload an L4S marking treatment SHOULD turn off ECN
marking, using drop as a congestion signal until the overload episode
has subsided, as recommended for all AQMs
(Section 4.2.1), which follows similar advice in RFC 3168 (Section
7).
The L4S treatment is the default for ECT(1) packets in all Diffserv
Classes .
For backward compatibility in uncontrolled environments, a network
node that implements the L4S treatment MUST also implement a classic
AQM treatment. It MUST classify arriving ECT(0) and Not-ECT packets
for treatment by the Classic AQM. Classic treatment means that the AQM
will mark ECT(0) packets under the same conditions as it would drop
Not-ECT packets .
For a host to send packets with the L4S identifier (ECT(1)), it
SHOULD implement a congestion control behaviour that ensures the
flow rate is inversely proportional to the proportion of bytes in
packets marked with the CE codepoint. This is termed a scalable
congestion control, because the number of control signals (ECN
marks) per round trip remains roughly constant for any flow rate. As
with all transport behaviours, a detailed specification will need to
be defined for each type of transport or application, including the
timescale over which the proportionality is averaged, and control of
burstiness. The inverse proportionality requirement above is worded
as a 'SHOULD' rather than a 'MUST' to allow reasonable flexibility
when defining these specifications.
Data Center TCP (DCTCP ) is an example of
a scalable congestion control.
Each sender in a session can use a scalable congestion control
independently of the congestion control used by the receiver(s) when
they send data. Therefore there might be ECT(1) packets in one
direction and ECT(0) in the other.
In order to coexist safely with other Internet traffic, a
scalable congestion control MUST NOT identify its packets with the
ECT(1) codepoint unless it complies with the following bulleted
requirements. The specification of a particular scalable congestion
control MUST describe in detail how it satisfies each
requirement:
A scalable congestion control MUST react to packet loss in a
way that will coexist safely with a TCP Reno congestion control
(see for rationale).
A scalable congestion control MUST react to ECN marking from
a non-L4S but ECN-capable bottleneck in a way that will coexist
with a TCP Reno congestion control (see
for
rationale).
A scalable congestion control MUST reduce or eliminate RTT
bias over as wide a range of RTTs as possible, or at least over
the typical range of RTTs that will interact in the intended
deployment scenario (see for rationale).
A scalable congestion control MUST remain responsive to
congestion when the RTT is significantly smaller than in the
current public Internet (see
for rationale).
In general, a scalable congestion control needs feedback of the
extent of CE marking on the forward path. Due to the history of TCP
development, when ECN was added it reported no more than one CE mark
per round trip. Some transport protocols derived from TCP mimic this
behaviour while others report the accurate extent of TCP marking.
This means that some transport protocols will need to be updated as
a prerequisite for scalable congestion control. The position for a
few well-known transport protocols is given below.
Support for accurate ECN feedback (AccECN
) by both ends is a
prerequisite for scalable congestion control. Therefore, the
presence of ECT(1) in the IP headers even in one direction of a
TCP connection will imply that both ends support AccECN.
However, the converse does not apply. So even if both ends
support AccECN, either of the two ends can choose not to use a
scalable congestion control, whatever the other end's
choice.
An ECN feedback protocol such as that
specified in would be
a prerequisite for scalable congestion control. That draft would
update the ECN feedback protocol sketched out in Appendix A of
the standards track specification of SCTP by adding a field to report the number of CE
marks.
A prerequisite for scalable
congestion control is for both (all) ends of one media-level hop
to signal ECN support using the ecn-capable-rtp attribute . Therefore, the presence of ECT(1) implies
that both (all) ends of that hop support ECN. However, the
converse does not apply, so each end of a media-level hop can
independently choose not to use a scalable congestion control,
even if both ends support ECN.
The ACK vector in DCCP is already sufficient to report the extent of
CE marking as needed by a scalable congestion control.
To implement the L4S treatment, a network node does not need to
identify transport-layer flows. Nonetheless, if an implementer is
willing to identify transport-layer flows at a network node, and if
the most recent ECT packet in the same flow was ECT(0), the node MAY
classify CE packets for classic ECN
treatment. In all other cases, a network node MUST classify CE packets
for L4S treatment. Examples of such other cases are: i) if no ECT
packets have yet been identified in a flow; ii) if it is not desirable
for a network node to identify transport-layer flows; or iii) if the
most recent ECT packet in a flow was ECT(1).
If an implementer uses flow-awareness to classify CE packets, to
determine whether the flow is using ECT(0) or ECT(1) it only uses the
most recent ECT packet of a flow {ToDo: this advice will need to be
verified experimentally}. This is because a sender might have to
switch from sending ECT(1) (L4S) packets to sending ECT(0) (Classic)
packets, or back again, in the middle of a transport-layer flow. Such
a switch-over is likely to be very rare, but It could be necessary if
the path bottleneck moves from a network node that supports L4S to one
that only supports Classic ECN. A host ought to be able to detect such
a change from a change in RTT variation.
The likelihood that an AQM drops a Not-ECT Classic packet (p_C)
MUST be roughly proportional to the square of the likelihood that it
would have marked it if it had been an L4S packet (p_L). That is
p_C ~= (p_L / k)^2
The constant of proportionality (k) does not have to be
standardised for interoperability, but a value of 2 is
RECOMMENDED.
specifies the
essential aspects of an L4S AQM, as well as recommending other
aspects. It gives example implementations in appendices.
The term 'likelihood' is used above to allow for marking and
dropping to be either probabilistic or deterministic. The example AQMs
in drop and mark
probabilistically, so the drop probability is arranged to be the
square of the marking probability. Nonetheless, an alternative AQM
that dropped and marked deterministically would be valid, as long as
the dropping frequency was proportional to the square of the marking
frequency.
Note that, contrary to RFC 3168, an AQM implementing the L4S and
Classic treatments does not mark an ECT(1) packet under the same
conditions that it would have dropped a Not-ECT packet. However, it
does mark an ECT(0) packet under the same conditions that it would
have dropped a Not-ECT packet.
This specification contains no IANA considerations.
Approaches to assure the integrity of signals using the new identifer
are introduced in .
Thanks to Richard Scheffenegger, John Leslie, David Täht,
Jonathan Morton, Gorry Fairhurst, Michael Welzl, Mikael Abrahamsson and
Andrew McGregor for the discussions that led to this specification.
Ing-jyh (Inton) Tsang was a contributor to the early drafts of this
document. in listing the TCP
Prague Requirements is based on text authored by Marcelo Bagnulo Braun
that was originally an appendix to . That text was in turn based on the
collective output of the attendees listed in the minutes of a 'bar BoF'
on DCTCP Evolution during IETF-94 .
The authors' contributions were part-funded by the European Community
under its Seventh Framework Programme through the Reducing Internet
Transport Latency (RITE) project (ICT-317700). Bob Briscoe was also
part-funded by the Research Council of Norway through the TimeIn
project. The views expressed here are solely those of the authors.
Adaptive RED: An Algorithm for Increasing the Robustness of
RED's Active Queue Management
ACIRI
ACIRI
ACIRI
Relentless Congestion Control
PSC
'Data Centre to the Home': Ultra-Low Latency for All
Bell Labs
Simula Research Lab
BT
Bell Labs
(Under submission)
One more bit is enough
Up to Speed with Queue View
BT
Uni Karlstad
Analysis of DCTCP: Stability, Convergence, and
Fairness
Notes: DCTCP evolution 'bar BoF': Tue 21 Jul 2015, 17:40,
Prague
Simula
Scaling TCP's Congestion Window for Small Round Trip
Times
BT
Bell Labs
This appendix is informative, not normative. It gives a list of
modifications to current scalable transport protocols so that they can
be deployed over the public Internet and coexist safely with existing
traffic. The list complements the normative requirements in that a sender has to comply with before
it can set the L4S identifier in packets it sends into the Internet. As
well as necessary safety improvements (requirements) this appendix also
includes preferable performance improvements (optimizations).
These recommendations have become know as the TCP Prague
Requirements, because they were originally identified at an ad hoc
meeting during IETF-94 in Prague . The wording
has been generalized to apply to all scalable congestion controls, not
just TCP congestion control specifically.
DCTCP is currently the most widely used
scalable transport protocol. In its current form, DCTCP is specified to
be deployable only in controlled environments. Deploying it in the
public Internet would lead to a number of issues, both from the safety
and the performance perspective. The modifications and additional
mechanisms listed in this section will be necessary for its deployment
over the global Internet. Where an example is needed, DCTCP is used as a
base, but it is likely that most of these requirements equally apply to
other scalable transport protocols.
Description: A scalable congestion control needs to distinguish
the packets it sends from those sent by classic congestion
controls.
Motivation: It needs to be possible for a network node to
classify L4S packets without flow state into a queue that applies an
L4S ECN marking behaviour and isolates L4S packets from the queuing
delay of classic packets.
Description: A scalable transport protocol needs to provide
timely, accurate feedback about the extent of ECN marking
experienced by all packets.
Motivation: Classic congestion controls only need feedback about
the existence of a congestion episode within a round trip, not
precisely how many packets were marked with ECN or dropped.
Therefore, in 2001, when ECN feedback was added to TCP , it could not inform the sender of more than one
ECN mark per RTT. Since then, requirements for more accurate ECN
feedback in TCP have been defined in and
specifies an
experimental change to the TCP wire protocol to satisfy these
requirements. Most other transport protocols already satisfy this
requirement.
Description: A scalable congestion control needs to react to
packet loss in a way that will coexist safely with a TCP Reno
congestion control .
Motivation: Part of the safety conditions for deploying a
scalable congestion control on the public Internet is to make sure
that it behaves properly when it builds a queue at a network
bottleneck that has not been upgraded to support L4S. Packet loss
can have many causes, but it usually has to be conservatively
assumed that it is a sign of congestion. Therefore, on detecting
packet loss, a scalable congestion control will need to fall back to
classic congestion control behaviour. If it does not comply with
this requirement it could starve classic traffic.
A scalable congestion control can be used for different types of
transport, e.g. for real-time media or for reliable bulk transport
like TCP. Therefore, the particular classic congestion control
behaviour to fall back on will need to be part of the congestion
control specification of the relevant transport. In the particular
case of DCTCP, the current DCTCP specification states that "It is
RECOMMENDED that an implementation deal with loss episodes in the
same way as conventional TCP." For safe deployment of a scalable
transport in the public Internet, the above requirement would need
to be defined as a "MUST".
Packet loss might (rarely) occur in the case that the bottleneck
is L4S capable. In this case, the sender may receive a high number
of packets marked with the CE bit set and also experience a loss.
Current DCTCP implementations react differently to this situation.
At least one implementation reacts only to the drop signal (e.g. by
halving the CWND) and at least another DCTCP implementation reacts
to both signals (e.g. by halving the CWND due to the drop and also
further reducing the CWND based on the proportion of marked packet).
We believe that further experimentation is needed to understand what
is the best behaviour for the public Internet, which may or not be
one of these existing approaches.
Description: A scalable congestion control needs to react to ECN
marking from a non-L4S but ECN-capable bottleneck in a way that will
coexist with a TCP Reno congestion control .
Motivation: Similarly to the requirement in , this requirement is a safety
condition to ensure a scalable congestion control behaves properly
when it builds a queue at a network bottleneck that has not been
upgraded to support L4S. On detecting classic ECN marking (see
below), a scalable congestion control will need to fall back to
classic congestion control behaviour. If it does not comply with
this requirement it could starve classic traffic.
It would take time for endpoints to distinguish classic and L4S
ECN marking. An increase in queuing delay or in delay variation
would be a tell-tale sign, but it is not yet clear where a line
would be drawn between the two behaviours. It might be possible to
cache what was learned about the path to help subsequent attempts to
detect the type of marking.
Description: A scalable congestion control needs to reduce or
eliminate RTT bias over as wide a range of RTTs as possible, or at
least over the typical range of RTTs that will interact in the
intended deployment scenario.
Motivation: Classic TCP's throughput is known to be inversely
proportional to RTT, so one would expect flows over very low RTT
paths to nearly starve flows over larger RTTs. However, Classic TCP
has never allowed a very low RTT path to exist because it induces a
large queue. For instance, consider two paths with base RTT 1ms and
100ms. If Classic TCP induces a 100ms queue, it turns these RTTs
into 101ms and 200ms leading to a throughput ratio of about 2:1.
Whereas if a Scalable TCP induces only a 1ms queue, the ratio is
2:101, leading to a throughput ratio of about 50:1.
Therefore, with very small queues, long RTT flows will
essentially starve, unless scalable congestion controls comply with
this requirement.
Description: A scalable congestion control needs to remain
responsive to congestion when RTTs are significantly smaller than in
the current public Internet.
Motivation: As currently specified, the minimum required
congestion window of TCP (and its derivatives) is set to 2 maximum
segment sizes (MSS) (see equation (4) in ).
Once the congestion window reaches this minimum, all current TCP
algorithms become unresponsive to congestion signals. No matter how
much drop or ECN marking, the congestion window no longer reduces.
Instead, TCP forces the queue to grow, overriding any AQM and
increasing queuing delay.
L4S mechanisms significantly reduce queueing delay so, over the
same path, the RTT becomes lower. Then this problem becomes
surprisingly common . This is because,
for the same link capacity, smaller RTT implies a smaller window.
For instance, consider a residential setting with an upstream
broadband Internet access of 8 Mb/s, assuming a max segment size of
1500 B. Two upstream flows will each have the minimum window of 2
MSS if the RTT is 6ms or less, which is quite common when accessing
a nearby data centre. So, any more than two such parallel TCP flows
will become unresponsive and increase queuing delay.
Unless scalable congestion controls are required to comply with
this requirement from the start, they will frequently become
unresponsive, negating the low latency benefit of L4S, for
themselves and for others. One possible sub-MSS window mechanism is
described in , and other approaches
are likely to be feasible.
Description: To improve performance, scalable transport protocols
ought to enable ECN at the IP layer in TCP control packets (SYN,
SYN-ACK, pure ACKs, etc.) and in retransmitted packets. The same is
true for derivatives of TCP, e.g. SCTP.
Motivation: RFC 3168 prohibits the use of ECN on these types of
TCP packet, based on a number of arguments. This means these packets
are not protected from congestion loss by ECN, which considerably
harms performance, particularly for short flows. counters each argument in
RFC 3168 in turn, showing it was over-cautious. Instead it proposes
experimental use of ECN on all types of TCP packet.
Description: It would improve performance if scalable congestion
controls did not limit their congestion window increase to the
traditional additive increase of 1 MSS per round trip during congestion avoidance. The same is true for
derivatives of TCP congestion control.
Motivation: As currently defined, DCTCP uses the traditional TCP
Reno additive increase in congestion avoidance phase. When the
available capacity suddenly increases (e.g. when another flow
finishes, or if radio capacity increases) it can take very many
round trips to take advantage of the new capacity. In the steady
state, DCTCP induces about 2 ECN marks per round trip, so it should
be possible to quickly detect when these signals have disappeared
and seek available capacity more rapidly. It will of course be
necessary to minimize the impact on other flows (classic and
scalable).
TCP Cubic was designed to solve this problem, but as flow rates
have continued to increase, the delay accelerating into available
capacity has become prohibitive. For instance, with RTT=20 ms, to
increase flow rate from 100Mb/s to 200Mb/s Cubic takes between 50
and 100 round trips. Every 8x increase in flow rate leads to 2x more
acceleration delay.
Description: Particularly when a flow starts, scalable congestion
controls need to converge (reach their steady-state share of the
capacity) at least as fast as classic TCP and preferably faster.
This affects the flow start behaviour of all transports that are
based on TCP slow start.
Motivation: As an example, a new DCTCP flow takes longer than
classic TCP to obtain its share of the capacity of the bottleneck
when there are already ongoing flows using the bottleneck capacity
(about a factor of 1.5 and 2 longer according to ). This is detrimental in general, but
particularly harms the performance of short flows relative to
classic TCP.
This appendix is informative, not normative. It records the pros and
cons of various alternative ways to identify L4S packets to record the
rationale for the choice of ECT(1) () as
the L4S identifier. At the end,
summarises the distinguishing features of the leading alternatives. It
is intended to supplement, not replace the detailed text.
The leading solutions all use the ECN field, sometimes in combination
with the Diffserv field. Both the ECN and Diffserv fields have the
additional advantage that they are no different in either IPv4 or IPv6.
A couple of alternatives that use other fields are mentioned at the end,
but it is quickly explained why they are not serious contenders.
Definition:
Packets with ECT(1) and conditionally packets with CE would
signify L4S semantics as an alternative to the semantics of
classic ECN , specifically:
The ECT(1) codepoint would signify that the packet was sent
by an L4S-capable sender;
Given shortage of codepoints, both L4S and classic ECN
sides of an AQM would have to use the same CE codepoint to
indicate that a packet had experienced congestion. If a packet
that had already been marked CE in an upstream buffer arrived
at a subsequent AQM, this AQM would then have to guess whether
to classify CE packets as L4S or classic ECN. Choosing the L4S
treatment would be a safer choice, because then a few classic
packets might arrive early, rather than a few L4S packets
arriving late;
Additional information might be available if the classifier
were transport-aware. Then it could classify a CE packet for
classic ECN treatment if the most recent ECT packet in the
same flow had been marked ECT(0). However, the L4S service
ought not to need tranport-layer awareness;
Cons:
The L4S service is
intended to supersede the service provided by classic ECN,
therefore using ECT(1) to identify L4S packets could ultimately
mean that the ECT(0) codepoint was 'wasted' purely to distinguish
one form of ECN from its successor;
It is not always
possible to support ECN in an AQM acting in a buffer below the IP
layer . In
such cases, the L4S service would have to drop rather than mark
frames even though they might contain an ECN-capable packet.
However, such cases would be unusual.
Having to
classify all CE packets as L4S risks some classic CE packets
arriving early, which is a form of reordering. Reordering can
cause the TCP sender to retransmit spuriously. However, one or two
packets delivered early does not cause any spurious
retransmissions because the subsequent packets continue to move
the cumulative acknowledgement boundary forwards. Anyway, the risk
of reordering would be low, because: i) it is quite unusual to
experience more than one bottleneck queue on a path; ii) even
then, reordering would only occur if there was simultaneous mixing
of classic and L4S traffic, which would be more unlikely in an
access link, which is where most bottlenecks are located; iii)
even then, spurious retransmissions would only occur if a
contiguous sequence of three or more classic CE packets from one
bottleneck arrived at the next, which should in itself happen very
rarely with a good AQM. The risk would be completely eliminated in
AQMs that were transport-aware (but they should not need to
be);
The classic ECN
RFCs and require
a sender to clear the ECN field to Not-ECT for retransmissions and
certain control packets specifically pure ACKs, window probes and
SYNs. When L4S packets are classified by the ECN field alone,
these control packets would not be classified into an L4S queue,
and could therefore be delayed relative to the other packets in
the flow. This would not cause re-ordering (because
retransmissions are already out of order, and the control packets
carry no data). However, it would make critical control packets
more vulnerable to loss and delay. To address this problem, proposes an experiment in
which all TCP control packets and retransmissions are
ECN-capable.
Pros:
The ECN field generally works
end-to-end across the Internet. Unlike the DSCP, the setting of
the ECN field is at least forwarded unchanged by networks that do
not support ECN, and networks rarely clear it to zero;
Unlike Diffserv, ECN is
defined to always work across tunnels. However, tunnels do not
always implement ECN processing as they should do, particularly
because IPsec tunnels were defined differently for a few
years.
If all classic ECN
senders eventually evolve to use the L4S service, the ECT(0)
codepoint could be reused for some future purpose, but only once
use of ECT(0) packets had reduced to zero, or near-zero, which
might never happen.
Definition:
For packets with a defined DSCP, all codepoints of the ECN
field (except Not-ECT) would signify alternative L4S semantics to
those for classic ECN , specifically:
The L4S DSCP would signifiy that the packet came from an
L4S-capable sender;
ECT(0) and ECT(1) would both signify that the packet was
travelling between transport endpoints that were both
ECN-capable;
CE would signify that the packet had been marked by an AQM
implementing the L4S service.
Use of a DSCP is the only approach for alternative ECN semantics
given as an example in . However, it was
perhaps considered more for controlled environments than new
end-to-end services;
Cons:
A DSCP is obviously not
orthogonal to Diffserv. Therefore, wherever the L4S service is
applied to multiple Diffserv scheduling behaviours, it would be
necessary to replace each DSCP with a pair of DSCPs.
The
resulting increased number of DSCPs might be hard to support for
some lower layer technologies, e.g. 802.1p and MPLS both offer
only 3-bits for a maximum of 8 traffic class identifiers. Although
L4S should reduce and possibly remove the need for some DSCPs
intended for differentiated queuing delay, it will not remove the
need for Diffserv entirely, because Diffserv is also used to
allocate bandwidth, e.g. by prioritising some classes of traffic
over others when traffic exceeds available capacity.
Very few networks
honour a DSCP set by a host. Typically a network will zero
(bleach) the Diffserv field from all hosts. Sometimes networks
will attempt to identify applications by some form of packet
inspection and, based on network policy, they will set the DSCP
considered appropriate for the identified application.
Network-based application identification might use some
combination of protocol ID, port numbers(s), application layer
protocol headers, IP address(es), VLAN ID(s) and even packet
timing.
Very few networks
honour a DSCP received from a neighbouring network. Typically a
network will zero (bleach) the Diffserv field from all
neighbouring networks at an interconnection point. Sometimes
bilateral arrangements are made between networks, such that the
receiving network remarks some DSCPs to those it uses for roughly
equivalent services. The likelihood that a DSCP will be bleached
or ignored depends on the type of DSCP:
These tend to be used to
implement application-specific network policies, but a
bilateral arrangement to remark certain DSCPs is often applied
to DSCPs in the local-use range simply because it is easier
not to change all of a network's internal configurations when
a new arrangement is made with a neighbour;
These do not tend to be
honoured across network interconnections more than local-use
DSCPs. However, if two networks decide to honour certain of
each other's DSCPs, the reconfiguration is a little easier if
both of their globally recognised services are already
represented by the relevant global-use DSCPs. Note that today a global-use DSCP gives little
more assurance of end-to-end service than a local-use DSCP. In
future the global-use range might give more assurance of
end-to-end service than local-use, but it is unlikely that
either assurance will be high, particularly given the hosts
are included in the end-to-end path.
Diffserv codepoints are often not
propagated to the outer header when a packet is encapsulated by a
tunnel header. DSCPs are propagated to the outer of uniform mode
tunnels, but not pipe mode , and pipe mode
is fairly common.
Because this
approach uses both the Diffserv and ECN fields, an AQM wil only
work at a lower layer if both can be supported. If individual
network operators wished to deploy an AQM at a lower layer, they
would usually propagate an IP Diffserv codepoint to the lower
layer, using for example IEEE 802.1p. However, the ECN capability
is harder to propagate down to lower layers because few lower
layers support it.
Pros:
If all usage of classic ECN
migrates to usage of L4S, the DSCP would become redundant, and the
ECN capability alone could eventually identify L4S packets without
the interconnection problems of Diffserv detailed above, and
without having permanently consumed more than one codepoint in the
IP header. Although the DSCP does not generally function as an
end-to-end identifier (see above), it could be used initially by
individual ISPs to introduce the L4S service for their own locally
generated traffic;
Definition:
This approach uses ECN capability alone as the L4S identifier.
It is only feasible if classic ECN is not widely deployed. The
specific definition of codepoints would be:
Any ECN codepoint other than Not-ECT would signify an
L4S-capable sender;
ECN codepoints would not be used for classic ECN, and the classic network service would
only be used for Not-ECT packets.
This approach would only be feasible if
it was generally agreed that there was little chance of any
classic ECN deployment in any network
nodes;
it was generally agreed that there was little chance of any
client devices being deployed with classic TCP-ECN on by default (note that classic
TCP-ECN is already on-by-default on many servers);
for TCP connections, developers of client OSs would all
have to agree not to encourage further deployment of classic
ECN. Specifically, at the start of a TCP connection classic
ECN could be disabled during negotation of the ECN
capability:
an L4S-capable host would have to disable ECN if the
corresponding host did not support accurate ECN feedback
, which is a prerequisite for the
L4S service;
developers of operating systems for user devices would
only enable ECN by default for TCP once the stack
implemented L4S and accurate ECN feedback including requesting accurate ECN
feedback by default.
Cons:
The
constraints for deployment above represent a highly unlikely, but
not completely impossible, set of circumstances. If, despite the
above measures, a pair of hosts did negotiate to use classic ECN,
their packets would be classified into the same queue as L4S
traffic, and if they had to compete with a long-running L4S flow
they would get a very small capacity share;
See the same issue
with "ECT(1) and CE codepoints" ();
See the same
issue with "ECT(1) and CE codepoints" ().
Pros:
The ECT(1)
codepoint and all spare Diffserv codepoints would remain available
for future use;
As with "ECT(1) and CE codepoints"
();
As with "ECT(1) and CE
codepoints" ().
It has been suggested that a new ID in the IPv4 Protocol field or
the IPv6 Next Header field could identify L4S packets. However this
approach is ruled out by numerous problems:
A new protocol ID would need to be paired with the old one for
each transport (TCP, SCTP, UDP, etc.);
In IPv6, there can be a sequence of Next Header fields, and it
would not be obvious which one would be expected to identify a
network service like L4S;
A new protocol ID would rarely provide an end-to-end service,
because It is well-known that new protocol IDs are often blocked
by numerous types of middlebox;
The approach is not a solution for AQMs below the IP layer;
Locally, a network operator could arrange for L4S service to be
applied based on source or destination addressing, e.g. packets from
its own data centre and/or CDN hosts, packets to its business
customers, etc. It could use addressing at any layer, e.g. IP
addresses, MAC addresses, VLAN IDs, etc. Although addressing might be
a useful tactical approach for a single ISP, it would not be a
feasible approach to identify an end-to-end service like L4S. Even for
a single ISP, it would require packet classifiers in buffers to be
dependent on changing topology and address allocation decisions
elsewhere in the network. Therefore this approach is not a feasible
solution.
provides a very high level
summary of the pros and cons detailed against the schemes described
respectively in , and , for six
issues that set them apart.
Issue
DSCP + ECN
ECN
ECT(1) + CE
initial eventual
initial
initial eventual
end-to-end
N . . . ? .
. . Y
. . Y . . Y
tunnels
. O . . O .
. . ?
. . ? . . Y
lower layers
N . . . ? .
. O .
. O . . . ?
codepoints
N . . . . ?
. . Y
N . . . . ?
reordering
. . Y . . Y
. . Y
. O . . . ?
ctrl pkts
. . Y . . Y
. O .
. O . . . ?
Note 1
Note 1: Only feasible if classic ECN is
obsolete.
The schemes are scored based on both their capabilities now
('initial') and in the long term ('eventual'). The 'ECN' scheme shares
the 'eventual' scores of the 'ECT(1) + CE' scheme. The scores are one
of 'N, O, Y', meaning 'Poor', 'Ordinary', 'Good' respectively. The
same scores are aligned vertically to aid the eye. A score of "?" in
one of the positions means that this approach might optimisitically
become this good, given sufficient effort. The table summarises the
text and is not meant to be understandable without having read the
text.
The ECT(1) codepoint of the ECN field has already been assigned once
for experimental use as the ECN nonce . ECN is
probably the only remaining field in the Internet Protocol that is
common to IPv4 and IPv6 and still has potential to work end-to-end, with
tunnels and with lower layers. Therefore, ECT(1) should not be
reassigned to a different experimental use (L4S) without carefully
assessing competing potential uses. These fall into the following
categories:
Receiving hosts can fool a sender into downloading faster by
suppressing feedback of ECN marks (or of losses if retransmissions are
not necessary or available otherwise).
proposes that a TCP sender could set
either of ECT(0) or ECT(1) in each packet of a flow and remember the
sequence it had set, termed the ECN nonce. If any packet is lost or
congestion marked, the receiver will miss that bit of the sequence. An
ECN Nonce receiver has to feed back the least significant bit of the
sum, so it cannot suppress feedback of a loss or mark without a 50-50
chance of guessing the sum incorrectly.
The ECN Nonce RFC is in the process of
being reclassified as historic. As far as is known, it has never been
deployed, and it was only implemented for a couple of testbed
evaluations. It would be nearly impossible to deploy now, because any
misbehaving receiver can simply opt-out, which would be unremarkable
given all receivers currently opt-out.
Other ways to protect TCP feedback integrity have since been
developed that do not consume any extra codepoints in the base IP
header. For instance:
the sender can test the integrity of the receiver's feedback by
occasionally setting the IP-ECN field to a value normally only set
by the network. Then it can test whether the receiver's feedback
faithfully reports what it expects . This works for loss and
it will work for the accurate ECN feedback intended for L4S;
A network can enforce a congestion response to its ECN markings
(or packet losses) by auditing congestion exposure (ConEx) . Whether the receiver or a downstream network
is suppressing congestion feedback or the sender is unresponsive
to the feedback, or both, ConEx audit can neutralise any advantage
that any of these three parties would otherwise gain.
The TCP authentication option (TCP-AO )
can be used to detect any tampering with TCP congestion feedback
(whether malicious or accidental). TCP's congestion feedback
fields are immutable end-to-end, so they are amenable to TCP-AO
protection, which covers the main TCP header and TCP options by
default. However, TCP-AO is often too brittle to use on many
end-to-end paths, where middleboxes can make verification fail in
their attempts to improve performance or security, e.g. by
resegmentation or shifting the sequence space.
ECN in RTP is defined so that the receiver
can ask the sender to send all ECT(0); all ECT(1); or both randomly.
It recommends that the receiver asks for ECT(0), which is the default.
The sender can choose to ignore the receiver's request. A rather
complex but optional nonce mechanism was included in early drafts of
RFC 6679, but it was replaced with a statement that a nonce mechanism
is not specified, explaining that misbehaving receivers could opt-out
anyway. RFC 6679 as published gives no rationale for why ECT(1) or
'random' might be needed, but it warns that 'random' would make header
compression highly inefficient. The possibility of using ECT(1) may
have been left in the RFC to allow a nonce mechanism to be added
later.
Therefore, it seems unlikely that anyone has implemented the
optional use of ECT(1) for RTP. Even if they have, it seems even less
likely that any deployment actually uses it. However these assumptions
will need to be verified.
Various researchers have proposed to use ECT(1) as a less severe
congestion notification than CE, particularly to enable flows to fill
available capacity more quickly after an idle period, when another
flow departs or when a flow starts, e.g. VCP ,
Queue View (QV) .
Before assigning ECT(1) as an identifer for L4S, we must carefully
consider whether it might be better to hold ECT(1) in reserve for
future standardisation of rapid flow acceleration, which is an
important and enduring problem .
Pre-Congestion Notification (PCN) is another scheme that assigns
alternative semantics to the ECN field. It uses ECT(1) to signify a
less severe level of pre-congestion notification than CE . However, the ECN field only takes on the PCN
semantics if packets carry a Diffserv codepoint defined to indicate
PCN marking within a controlled environment. PCN is required to be
applied solely to the outer header of a tunnel across the controlled
region in order not to interfere with any end-to-end use of the ECN
field. Therefore a PCN region on the path would not interfere with any
of the L4S service identifiers proposed in .