Reading through a Cisco paper on the Nexus 5548P switch architecture, I came across the term iSLIP.
“The [Unified Crossbar Fabric] is a single-stage high-performance 100-by-100 crossbar with an integrated scheduler. The scheduler coordinates the use of the crossbar between inputs and outputs, allowing a contention-free match between I/O pairs. The scheduling algorithm is based on an enhanced iSLIP algorithm. The algorithm helps ensure high throughput, low latency, and weighted fairness across inputs, and starvation- and deadlock-free best-match policies across variable sized packets.”
Later, while describing the unified crossbar fabric in more detail, iSLIP showed up again.
“The scheduler coordinates the use of the crossbar between input and output ports. The original iSLIP algorithm, which is based on iterative round-robin scheduling, has been enhanced to accommodate cut-through switching of different packet sizes. There is a separate scheduler for unicast and a separate scheduler for multicast. The scheduler uses a credit system when allocating bandwidth to each egress port. The credit system monitors fabric buffer and egress buffer use per egress port before a grant is sent to an ingress port to give access to the fabric. This approach helps ensure that the crossbar fabric is lossless, and it enables flow control to ingress ports when congestion occurs.”
This put me on a quest to find out what iSLIP is and what problems it solves.
A Little Background
“What’s a crossbar fabric?” you might wonder. Think for a moment about what a network switch really does, way down deep at a fundamental level. A network switch takes traffic in on one port (ingress), and sends it out another port (egress). That’s it. Yes, there’s all sorts of magic that happens between ingress and egress (TCAM lookups, queueing, policy application, and so on), but the guts of a network switch are designed to take incoming traffic, determine where that traffic should go, and send it there.
How does traffic make it from the ingress port to the egress port? What creates the connection between the ingress and egress port so that the traffic has a path to follow? A crossbar fabric. In the Nexus 5548P, there happens to be a single 100×100 crossbar fabric. If there were more ports in the switch than feasible to connect via a single crossbar fabric, then the switch would use multiple crossbar fabrics, commonly in a Clos topology that connects all of the ingress and egress crossbar fabrics together.
Now think about traffic zipping around inside a network switch. Traffic in, crossbar switch, traffic out – millions of times, over and over. This will keep working until incoming traffic on two different ports need to leave the switch via the same egress port. Now there’s contention. Which of the two ingress ports should be allowed to send their traffic to the egress port? And where does the ingress traffic that has to wait park in the meantime? In a switch like the Nexus 5548P, the traffic will wait in an input queue, since it is an input-queued switch. (Note that the Nexus 5548P also does egress queuing, but that’s not the focus of our discussion here.) Determining which ingress queue will get to deliver traffic to the egress port first is an arbiter. More on that in a bit.
The scenario as I’ve described thus far is linear. If you’re picturing packet flowing through the switch, you might imagine incoming packets stacking up behind each other in a queue, waiting for their turn to traverse the crossbar fabric to their egress port. If a switch was actually constructed in this way, it would be subject to a well-known problem called head-of-line (HOL) blocking. HOL blocking is what happens when the packet in the front of the line is waiting to be forwarded across the fabric to his egress point – that packet is holding up all of the other packets. In a first-in, first-out (FIFO) queuing scheme, this is exactly what happens. All of the stacked-up packets have to wait for the packets in front of them to be sent. As a result, throughput is mathematically limited to roughly 58.6% of the theoretical maximum capability of the switch.
You’ve probably already deduced that a modern crossbar fabric in a data-center class switch couldn’t possible use FIFO. Today’s networks require switching fabrics that can deliver hundreds of gigabits or terabits per second of throughput; a single input buffer dequeued by FIFO can’t scale to that level of performance. Instead, modern switches replace the idea of a single FIFO queue per input port with multiple virtual output queues (VOQs) per input port. One VOQ exists per output port for which input traffic is destined. Therefore, an input port can have several packets queued in several different VOQs, assuming several different output ports. What’s more, each of these VOQs are eligible to be serviced during a clock cycle; the practical benefit of VOQs is that HOL blocking is eliminated.
Back to the arbiter mentioned earlier. Even with VOQs, there is still the potential for contention across the crossbar fabric; this is true in two ways.
- Multiple inputs could be requesting access to the same output. (An output can only send one packet per clock cycle.)
- Multiple VOQs on the same input could be granted access to an output. (An input can only send one packet per clock cycle.)
The iSLIP algorithm is the arbiter that manages these contentions, scheduling traffic in such a way that the crossbar fabric is able to achieve 100% throughput.
What iSLIP Does
iSLIP is an algorithm that determines which ingress ports will send to which egress ports across the crossbar fabric in a input-queued switch. This paper by Nick McKeown explains how iSLIP works. Nick’s paper goes into great detail, complete with diagrams. The paper compares similar algorithms, and explains how iSLIP improves on them. iSLIP’s performance under various traffic conditions is notated. The paper also explains the challenge of implementing iSLIP in silicon, including a table showing the number of gates required to execute the iSLIP algorithm depending on the size of the crossbar fabric.
For our purposes, the most interesting element of iSLIP to understand is the algorithm itself in its simplest form: a single iteration (running through the algorithm once in a clock cycle). There are three key steps that happen.
- Request. All input points (ingress) on the crossbar fabric with queued traffic asks their output points (egress) if they can send.
- Grant. Each output point that received a request must determine which input point can send. If there was only a single request, this is straightforward. If there are multiple requests, the output point determines which input point can send via round-robin. When this has been determined, each output point that received a request sends a “grant” message (meaning “yes, you can send”) to the appropriate input point.
- Accept. An input point considers all of the grant messages it has received from output points, it chooses a grant in round-robin fashion. Upon selection, the input notifies the output that the grant has been accepted. If and only if the output point is notified that the grant was accepted will it move onto the next request. If there is no “accept”, then the output point will attempt to service the same request it received previously during the next run of the algorithm.
When iSLIP is run through multiple iterations, the result is that the number of input-to-output matches is improved (i.e. more packets can be sent across the crossbar fabric at a time). So, how many times does iSLIP need to run to maximize the number of packets that can be switched through the crossbar fabric in a clock cycle? Nick’s paper suggests that in most traffic patterns, running iSLIP beyond 4 iterations does not yield significantly greater matches.
One final point worth mentioning is that this overview has considered traffic of equal importance. However, in modern data centers, certain traffic classes are prioritized. For instance, FCoE frames need to traverse the fabric in a lossless manner, while a TCP session in a scavenger QoS class does not. Does iSLIP handle traffic with different priorities? Yes, but in a modified form of the algorithm we’ve looked at. Nick’s paper briefly discuss iSLIP variants called Prioritized, Threshold and Weighted. Also note that in the second quote above, Cisco’s paper describes a bit about the enhancements that have been made in their iSLIP implementation.
- Cisco Nexus 5548P Switch Architecture (PDF)
- The iSLIP Scheduling Algorithm for Input-Queued Switches by Nick McKeown (PDF)
- I’ve used the term “packet” in this post to refer generically to a chunk of data. I might have also said “frame” or “cell”. Nick used the term “cell” in his paper.
- The Nexus 5548P has a lot of additional complexity in the crossbar fabric beyond the simple look at iSLIP I took here. iSLIP plays its part in Nexus architecture, but there’s more to the story.
- If any of you with a deep understanding of switch architecture see an error in my analysis, please let me know so that I can correct appropriately.