Having fundamental knowledge of what affects TCP, UDP, and ultimately IP itself helps you to better troubleshoot the network when things go wrong. Even though as an industry we are moving toward software-defined everything and automated solutions, there will always be a need to know what’s going on under the hood (doesn’t our industry love the car analogies!).
Sure, you might not be aware of working with many of these basics on a day-to-day basis, but when an application on the network is not performing optimally, and you are the person looked upon for a reason why, knowing these fundamentals will lead you to a quicker solution.
This mega-post covers the following topics:
- Unicast flooding
- Out of order packets
- Asymmetric routing
- The impact of microbursts
- ICMP unreachables and redirects
- IPv4 options and IPv6 extension headers
- IPv4 and IPv6 fragmentation
- IP MTU
- IPv4 and IPv6 path MTU discovery
- TCP latency
- Bandwidth delay product
- Global synchronization
- TCP options
- UDP latency.
Unicast flooding occurs when there is no entry in a device’s Layer 2 table (CAM), and so the incoming unicast frame is flooded to all ports within the VLAN. This is normal for when a frame is first received on a device, such as a switch, and the destination address is not yet known. The switch floods the unicast frame out all ports in the VLAN (except the received port), and when the destination address is discovered, an entry is created in the CAM table. Because both ARP and CAM entries age out, one solution to this problem is the make sure ARP entries expire before CAM entries, so that the ARP table can be re-populated before the CAM entry expires, which will reduce flooding.
On Cisco IOS, you can block unicast flooding on a port with switchport block unicast which will prevent unknown unicasts from being flooded to ports configured with this command.
Out of Order Packets:
Also referred to as “out of sequence” packets, these are packets that arrive in a different order from which they were sent. There are many potential causes for this, such as asymmetric routing and packet loss among diversified paths in the network. TCP includes sequence numbers in attempt to alleviate this issue by allowing the receiver to cache received TCP segments and reassemble them in the correct order. However, packets that arrive out of order typically inhibit network performance dramatically. For example, the TCP receiver could send duplicate ACKs to trigger the fast retransmit algorithm. The TCP sender, upon receiving the duplicate ACKs, assumes packets were lost in transit and reduces the TCP window size, which reduces the TCP throughput.
Forwarding schemes that implement per-packet load distribution often result in out-of-order packets being received at the destination.
Routing is considered asymmetric when the traffic return path is different than the transmit path. “Hot potato” routing, where traffic traversing a network egresses at the nearest exit point, and link redundancy/alternate network paths, are two common causes of asymmetric routing. This can also be caused by NAT and firewall traversal. When the number of routers that traffic must pass through increases, the likelihood of asymmetric routing also increases.
Asymmetric routing is not normally an issue unless the traffic passes through a device that is expecting to see both sending and return traffic from a single IP flow. For example, if traffic is flowing through a firewall, it will generally expect to see traffic returned to it. If the traffic is sent to a firewall different from where the session was initially established, the traffic generally will be dropped. Most redundant network devices account for this by maintaining synchronized state tables.
Impact of Microbursts:
Microbursts are small spikes of traffic in the network and are typically encountered when traffic enters a higher-speed interface and exits a lower-speed interface, or in “fan-in” scenarios where traffic from multiple ports are trying to reach a single port. The impact to network traffic is usually observed as jitter, latency, and packet drops. Packet drops can be mitigated with larger buffers in the network devices at the cost of increasing the packet latency.
The impact of microbursts can also be observed when the networking device is unable to fully process all the traffic at line rate. For example, a software-based router may have gigabit interfaces, but may be unable to handle a sudden burst of traffic at a full gigabit of speed. After the packet buffers have filled, packets will begin to drop, which can be observed on interface statistics as output drops.
ICMP Unreachables & Redirects:
The ICMP Destination Unreachable message acts as a feedback mechanism for a router to let a source device know that it has no method to communicate with the desired destination. However, it is up to the sending device, upon receiving the ICMP Destination Unreachable message, whether to take any particular action. The ICMP Destination Unreachable (Type 3) message has six code types to indicate what type of failure is present:
- Code 0 “Network Unreachable” means the packet could not be delivered to the network specified
- Code 1 “Host Unreachable” means the packet can be routed to the destination network, but the specified host does not exist on the destination network.
- Code 2 “Protocol Unreachable” is sent when the router receives a nonbroadcast message destined for itself that uses an unknown protocol.
- Code 3 “Port Unreachable”
- Code 4 “Fragmentation need and the DF bit is set”
- Code 5 “Source Route Failed”
ICMPv6 uses Type 1 messages.
ICMP Redirect messages (ICMP Type 5) are used when there are multiple gateways on the same network segment, and a host sends a packet to one of the gateways, but a different gateway on the same network segment has a better metric to the destination. When the gateway configured on the host receives the packet, but determines via its routing table that a different gateway on the same network segment is closer, it forwards the packet to the better gateway, and sends an ICMP Redirect message back to the host indicating that it should use the better gateway to reach the destination for future packets. ICMPv6 uses Type 137.
IPv4 Options and IPv6 Extension Headers:
IP options provide additional flexibility in how packets are handled. All devices using IPv4 must be able to handle IP packets with options. IP packets may contain zero or more options, which makes the total length of the Options field in the IPv4 header variable.
Each option can be either a single byte, or multiple bytes, depending on how much information the option needs to convey. When multiple options are present, they appear together in the options field, including any necessary padding to make the options field a multiple of 32 bits.
Individual options are generally structured as TLVs, except for those where the option type itself indicates all the required information, in which case the option length and option data subfields are disabled. The option type octet has three subfields:
- Copied flag, where a value of 1 means that the option is copied into all the fragments upon packet fragmentation
- Option class, where 0 indicates the control class, 2 indicates debugging and measurement, and 1 & 3 are reserved
- Option number, which is a 5-bit field to indicate 32 different options (defined by IANA)
Common options are:
- Record Route, which allows the source to create an empty list of IPv4 addresses and requests each router along the path to add its IPv4 address to the list
- Strict Source Route, where the complete path the datagram must follow to its destination is specified
- Loose Source Route, where the path is specified, and all routers in the list must be traversed, but additional routers may also be in the path
- Router Alert, which causes each router along the path to examine the packet, even if the router is not the ultimate destination
For IPv6, most of the options have been removed or altered, and are placed after the main IPv6 header in one or more Extension Headers. By doing this, the main IPv6 packet header remains a fixed size of 40 bytes, which increases the speed of packet processing. Extension headers are not examined or processed by any node along the packet’s delivery path, until the packet reaches the node(s) identified in the Destination Address (DA) field of the IPv6 header, except for “Hop-by-Hop Options”, which must be the very first extension header if it is present, and is examined by every node along the path.
Within the main IPv6 header, the “Next Header” field indicates the type of header to follow, based on the header code. All extension header types include a Next Header field which logically links together all the extension headers, with the final extension header’s Next Header field pointing to the payload itself.
Common Next Header value codes are:
- 0: Hop-by-Hop Options — this special option is examined by all devices along the path, unlike the other options
- 43: Routing — used similarly to the loose source routing option in IPv4
- 44: Fragment — includes fragment offset, identification, and more fragments fields
- 50: ESP Encapsulating Security Payload
- 51: AH Authentication Header
- 60: Destination Options — options intended to be examined only by the destination node.
When multiple extension headers are present, they should be placed into the following order:
- Hop-By-Hop Options
- Destination Options (for options to be processed by the destination as well as devices specified in a Routing header)
- Destination Options (for options to be processed by the final destination only)
Whether in the main IPv6 header, or the final IPv6 Extension Header, the payload itself is referred to by its IANA-assigned protocol number. For example, TCP is protocol number 6, UDP is 17, and so on.
IPv4 and IPv6 Fragmentation:
When the payload of an IP packet is larger than the MTU of the data link, it must be fragmented, unless the DF (Don’t Fragment) bit is set, in which case the packet is dropped. When a packet is dropped because fragmentation is not allowed, an ICMP Destination Unreachable message may be returned.
The intermediate devices must keep track of all the fragments to determine the proper order for reassembly at the destination. IP packets have a “More Fragments” and a “Fragment Offset” field to help keep track of everything. If an IP packet is fragmented, the MF bit is set on all fragments except the final one. The FO field is 13-bits with each increment representing 8 bytes. For example, if the FO field has a decimal value of 200, it means that this fragment begins 200×8 bytes (1600 bytes) into the payload.
When a packet is fragmented, the header of the original packet is transformed into the header of the first packet fragment, and new headers are created for the additional fragments, each containing the same identification value, but with different FO values.
A packet may need to be fragmented multiple times during transmission if the MTU decreases multiple times along the path. Routers along the path do not perform fragmentation reassembly, even when a fragment is fragmented again due to an even lower MTU along the path. It is up to the TCP/IP stack in the end device to reassemble the fragments. Fragmentation in the network introduces extra overhead, since only the ultimate destination device can re-assemble the fragments.
Part of the reason for leaving reassembly to the end device is because fragments may take different paths in the network, and therefore the intermediary routers may not see all fragments. The ultimate destination device uses a buffer to collect and reassemble the fragments back into the original payload. However, if any of the fragments are not received when the reassembly timer expires, the entire packet is dropped, and it is up to the higher-layer protocol (such as TCP) to inform the sender that the packet was not received.
Cisco IOS supports a feature called “IP Virtual Fragment Reassembly” where all the fragments are collected and reassembled for further special processing (such as with ACLs), which is needed for some applications like NAT and security-related processing. After the processing is performed, the fragments are forwarded like normal. VFR is disabled by default in IOS, but is enabled automatically when needed, such as when using NAT.
With IPv6, only the source device may fragment IP packets. Intermediary devices, such as routers, do not fragment IPv6 packets. When fragments are present, IPv6 uses a Fragmentation extension header. Some extension headers are considered unfragmentable, and must be present in each fragment header, such as hop-by-hop options. Other extension headers, such as AH and ESP, may be fragmented along with the payload.
TTL and Traceroute:
The 8-bit Time-To-Live field in the IP header was originally used to limit the amount of time in seconds that a packet could exist on the internetwork. Today, it is used as a hop count, where each router decrements the value by 1 as it passes through. When the TTL reaches 0, the packet is dropped, and an ICMP Time Exceeded message may be generated in response.
Since packets are dropped when they reach a TTL of 0, this acts as a method to break logical loops in the network, because IP packets cannot be present in the network indefinitely, unlike Layer 2 Ethernet frames, which have no TTL field. Layer 2 Ethernet frames have the potential to loop around the network forever.
Traceroute uses TTL to determine the path through the network by sending UDP packets with a low TTL, causing routers along the path to drop the packets and send back ICMP Time Exceeded messages. The device performing the traceroute sends a UDP packet with a TTL of 1 toward the destination. The first-hop router decrements the TTL by 1, and since the TTL reaches 0 at that point, a Time Exceeded message is returned.
Then the device performing the traceroute sends a UDP packet with a TTL of 2 toward the destination, which the first-hop router decrements by 1 and passes on to the next hop. The next hop decrements the TTL, where it becomes 0 again, and the router returns a Time Exceeded message. This process repeats with increasing TTL values until the final destination is reached, upon which time the destination will usually report back an ICMP Destination Unreachable message, due to a random UDP port number having been chosen by the device initiating the traceroute for the traceroute probe.
When performing a traceroute from Cisco devices, which send three probes to each hop by default, the second probe in the final hop usually times out. This is due to the default ICMP rate limiting of Cisco IOS. The error messages returned from the intermediate routers are “TTL Exceeded”, whereas the message returned by the ultimate destination is “Destination Unreachable”.
The way traceroute truly works is platform-dependent. For example, some platforms use ICMP Echo messages instead of UDP probes, but the general concept of gradually increasing the TTL of the sent packets remains the same.
The Maximum Transmission Unit is the total maximum number of bytes supported in the payload of the transmission. While MTU is often associated with the data-link layer, it also applies to other protocol layers. The higher-layer MTU must fit within the lower-layer MTU. With the data link layer, each technology (Ethernet, serial DS3, etc.) has its own frame format and its own supported MTU.
For example, the default Ethernet MTU is usually 1500 bytes in most implementations. When an IP packet carrying a TCP segment needs to be sent, 20 bytes are used for the IP header, and 20 for the TCP header, which leaves 1460 bytes left for the actual data payload. When setting the MTU, some platforms (like Classic IOS) do not consider the Layer 2 header, while others (like IOS-XR) do. The default MTU of 1500 for an Ethernet interface on Classic IOS is equivalent to the default Ethernet MTU 1514 on IOS-XR.
If the data to be sent is larger than the supported MTU on an interface, it must be either fragmented or dropped.
Larger MTU values reduce protocol overhead at the expense of having to re-transmit more data when data is lost or corrupted during transport.
MTU can be an issue for IP when different tunneling protocols are used on top of IP. For example, IP-in-IP adds another 20 bytes of overhead, effectively reducing the MTU of the payload by 20.
IPv4 and IPv6 Path MTU:
To avoid fragmentation when two devices need to communicate over an IPv4/IPv6 network, the MTU value must either be that of the smallest MTU link along the entire path, or the MTU must be set to the minimum allowed by the protocol, which is 576 bytes for IPv4, and 1280 bytes for IPv6.
Path MTU Discovery for IPv4 works by setting the DF bit in the IP header, and any device along the path that cannot support the MTU will drop the packet, but should send back ICMP Type 3 Code 4 (Fragmentation Needed) with its MTU size.
Path MTU Discovery for IPv6 is implemented in the sending device, which starts with the assumption that the path MTU is that of the sending device’s connected link. If a device along the path has a smaller MTU, it will drop the packet and send back an ICMPv6 Type 2 (Packet Too Big) message, containing the MTU.
Most end devices using pMTUd will periodically send new probes to see if the MTU has increased. The default of most implementations, as recommended in RFC 1191, is 10 minutes.
A drawback of pMTUd is that different packets may take different paths in the network, each with their own different MTU sizes.
The TCP Maximum Segment Size represents the amount of data a host will accept in a single TCP/IP datagram. If the MSS is larger than the MTU, plus protocol overhead, the datagram must be fragmented at the IP layer.
The MSS of a host is sent in the TCP SYN. However, hosts do not negotiate the MSS, and will normally use the lowest of the two values. Most hosts will take this a step further and use the outgoing interface’s MTU as part of the MSS calculation. For example, with a typical Ethernet MTU of 1500, the MSS is calculated as 1460 bytes because of the 20 bytes of IP header overhead, and 20 bytes of TCP header overhead. This is done to attempt to avoid fragmentation at the IP layer.
Likewise, when implementing tunneling, the TCP MSS is often adjusted to avoid fragmentation at the IP layer because of the overhead associated with the tunneling protocol(s).
Cisco IOS supports changing the MSS of TCP SYN packets that are sent through the router. This is commonly used with PPPoE, which supports an MTU of 1492 bytes.
The standard TCP Default MSS is 536 bytes (576 bytes for the minimum IP MTU, minus 20 bytes for the IP header, minutes 20 bytes for the TCP header). However, most implementations using Ethernet-based networks set the default TCP MSS to 1460 bytes.
TCP latency is often defined by the RTT Round Trip Time, which is the length of time it takes to receive back a response from a TCP message. For example, establishing a new TCP session involves sending a SYN and expecting to receive a SYN/ACK in response.
Latency begins with the propagation delay, which is no faster than the speed of light. Serialization delay, and intermediary device processing also add to the overall latency.
TCP has an inverse relationship between latency and throughput: when the latency increases, the throughput is decreased. When the latency is increased, the sender may be idle while waiting for acknowledgements. Packet loss, combined with latency, further compounds the effect on the overall throughput.
UDP traffic does not suffer from these same throughput issues in the presence of latency because it is connectionless and is not expecting to receive back acknowledgements.
TCP is a reliable, connection-oriented protocol, which acknowledges the successful receipt of packets. However, if TCP had to send an acknowledgement for each individual packet, the overhead would be increased, and the performance would be decreased, which is why windowing is implemented.
Windowing allows a single acknowledgement to refer to multiple TCP packets. The window size specifies how many bytes may be sent before an ACK is required. TCP uses cumulative acknowledgement, which means a single value is sent to acknowledge a range of data (without making use of the Selective Acknowledgement feature). For example, if the window size is 1000 and bytes 1 – 1000 have been received successfully, an ACK with the value 1001 is sent, indicating the starting point for the next set of data.
The sliding window refers to a reference point within the entire TCP stream. For example, if the window size is 1000, and 900 contiguous bytes were acknowledged, the window referencing the entire TCP stream can be shifted 900 bytes to the right, and 900 more bytes can now be sent.
The window size can be adjusted by the both sender and recipient based on how much data they are willing to accept. For example, a device will only have so much room in its buffers, which must be cleared, either partially or completely, before more data can be accepted.
The window size is indicated by a 16-bit integer in the TCP header. 16 bits limits the window to 64KB. However, through the use of a TCP option, this number can be scaled to 32-bits, for a range up to 4GB. TCP window scaling is often used as a method to alleviate the symptoms of networks containing a large bandwidth delay product (LFNs, Long Fat Networks), such as high-speed high-delay satellite links.
Bandwidth Delay Product:
The Bandwidth Delay Product refers to the amount of data that can be in transit at any time between hosts, and is calculated by multiplying the capacity of the link in bits per second by its round-trip delay time in seconds.
Networks with large bandwidth delay products are known as LFNs, or Long Fat Networks. An example is a high-speed satellite link. The link may have high bandwidth capacity, but it also has a larger delay, which may cause issues with TCP windowing. TCP window scaling is used to alleviate the issue. This can also occur in ultra-high-speed networks.
Global synchronization refers to multiple TCP streams on a link gradually expanding their window sizes until the link becomes congested and starts dropping traffic, which causes all TCP senders to reduce their window sizes and repeat the process. This results in a sawtooth-shaped graph of bandwidth utilization on the link.
Global synchronization results as a combination of how TCP uses slow-start and windowing, combined with tail-drop queuing on the router. One way to alleviate these symptoms is to use Random Early Detection queuing, where packets in a queue approaching congestion are randomly discarded, which causes the individual TCP stream to reduce its window size temporarily. By perform this action randomly on individual TCP streams, instead of all at once on all TCP streams (tail drop), the bandwidth of the link is used more efficiently.
Within the TCP header is an Options field that can be of variable length from 0 to 320 bits, aligned to 32-bit boundaries. The first byte is the Option-Kind which identifies the type of option, and by association whether the option consists of more than a single byte. Multi-byte options also include a 1-byte Option-Length field, and a variable-length Option-Data field.
Many options are present only in the initial SYN request packet. Common options are:
- 0: End of Option List
- 2: MSS Value
- 3: Window Scale
- 4: SACK Permitted
- 5: Blocks of selectively-acknowledged data
- 8: Timestamp
TCP Starvation / UDP Dominance occurs when TCP and UDP streams occupy the same queue. When congestion begins to occur and packets are dropped, TCP reacts by reducing its window size and thereby decreasing its transmission speed temporarily. UDP does not have a similar method of traffic control inherent to the protocol, and as the bandwidth utilization by TCP decreases, the UDP traffic increases. When congestion occurs again, TCP is further starved, and UDP further dominates.
The solution to this issue is to create separate queues for TCP and UDP traffic.
Unlike TCP, UDP does not expect to receive acknowledgements and does not suffer from the same throughput-related issues that TCP experiences in the presence of latency and packet loss.
UDP latency can affect application performance, such as voice quality issues in VoIP systems. However, UDP does not implement windowing, as TCP does, so overall throughput is not lost in the presence of latency in the same way that effects TCP.
UDP is commonly used for real-time applications. These applications can be sensitive to both latency, and jitter, which is the variation of latencies. These applications often attempt to alleviate the symptoms of latency and jitter with buffers, which collect the data and then present it to the application.
Most of these networking basics can be further examined by using a packet capture analysis tool, such as Wireshark.
Note: This is an edited/abridged version of a post I made on my blog site. The original post is more Cisco-centric, and contains many different links on all the different topics.