Continued from Part 1
Fat Tree
Fat-Trees were originally introduced by Charles Leiserson in 1985. The basic idea behind fat-trees is to alleviate the bandwidth bottleneck closer to the root with additional links. Below is a logical representation of a 4 level Fat-Tree, you can see how the link capacity grows towards the root. A Fat-Tree is generally represented by FT(k,n) where k is the radix of the switch and “n” is the levels of the Fat-Tree.
[Ref: https://en.wikipedia.org/wiki/Fat_tree]
One interesting property of a Fat-Tree is that there will be multiple paths to the destination from the source towards the root (upstream) but only one path from the Root to the destination (downstream).
Fat-Trees as special case of Clos network
So earlier when we looked at the Clos Networks, we represented it with a set of input and output switches and the links were unidirectional. We know that in the case of Data networks, every switch is an Input and an Output switch and the links are bi-directional. With that difference in mind, If we go back to our Rearrangeable Nonblocking topology and fold the topology in the middle, we get the topology in the right. This topology is known by various terms like Folded Clos, Spine and Leaf or 3 stage clos network.
Below is a representation of visualizing Fat-tree from a 3 stage folded clos topology.
One thing to keep in mind is that in the above Spine/Leaf topology, Spine’s have a radix of 8 ports while the leaves have radix of 4 ports. If we want to construct a pure FT (4,2) Fat-tree, where radix =4 and level =2, then all the switches need to have the same radix. This means that there can be only 4 leaves attached to each spine like below.
Now let’s make our 5 Stage network look like a Fat-Tree. The first step we do is that we rearrange the blue and green switches like on the right and then we fold in the middle.
This will result in the topology like below. It’s a Fat-Tree derived from a 5 stage Clos network. Formally, we can say it’s FT(4,3) as the radix of the switches are 4 and it has 3 levels.
The above topology is exactly the same as below which was presented in the paper “A Scalable, Commodity Data Center Network Architecture“. From now on, we will use the Spine, Aggregation and Edge terminology to refer each layer.
Fat-tree topology summary
Assuming a switch has “k” ports and the number of Levels is “L”, then the first column in the below table gives you the generalize way to calculate various parameters. Next three columns, shows various numbers for 2,3 and 4 levels. Based on the number of hosts we need to support, one can calculate how many levels of tree is needed.
If you want to fiddle with various level and radix’s and visualize, you can try playing here http://liulonnie.net/ftree-vis/
Design examples for Fat-tree
Now it’s time for us to look at few examples. Let’s assume that we have a switch with radix of 32 ports with each port being capable of doing 25/50/100G. We will assume that switch to server connectivity is 25G and between switches is 100G.
A two Level Fat-tree FT(32,2) with k = 32, L=2
Typically, while designing a fat-tree, we split the k parts in half, i.e. k/2 ports down and k/2 ports up. This means we will have 16 ports towards the server and 16 ports up for uplink.
- 64 servers with 25G connectivity per switch since we have 16 25/50/100G capable ports.
- Maximum of 16 spines.
- A total of 32 leaves.
- Maximum number of servers supported = Total Ports facing towards server x 4 (as each port can support up to 4 25G servers). This gives 512Ports x 4 = 2048 servers.
Please note that I am skipping the fact that even though we can get 64 servers per switch any standard DC rack will host less servers than that. So some ports will be unused. Below is a visualization of the Two level Fat-Tree.
A three level Fat-Tree FT(32,3) with k=32, L=3
Like earlier, we will split the k ports in half upstream and half downstream. Which gives us:
- 64 servers per switch
- Maximum of 256 spine switches.
- Maximum of 512 leaf switches.
- A total of 8192 host facing ports which gives us a total of 32768 (8192 x 4) servers.
Below is a high level representation of FT(32,3).
So both examples have no oversubscription at any level. But many times you may want to grow the fabric incrementally as the traffic ramps up rather than deploying the full scale fabric at once. For instance, let’s pick our three level Fat-tree example. We could divide all the 256 spines in sets of 16 (k/2) spines which gives us a total of 16 spine sets. We can then just start by deploying few spine sets based on the oversubscription we are comfortable with and then grow it gradually by adding more spine-sets.
For instance, In the below fig. We have deployed only 4 spine sets which gives you an oversubscription of 4:1 at aggregation layer and 1:1 at the edge.
Decomposing some publicly known topology
Let’s try to decompose some publicly known topologies with a disclaimer that this is just my interpretation and I could be totally wrong about this.
Let’s try to derive LinkedIn’s DC fabric presented here https://engineering.linkedin.com/blog/2016/03/the-linkedin-data-center-100g-transformation from bottom up as an exercise.
Assumptions
- We have a switch with radix of 32 x100G where every port is capable of doing 25/50/100G.
- We have 25G towards hosts and 50G for the uplinks between switches.
- Since we are using 50G for the connectivity between the switches, we can assume value of k =64 for a radix of 32 x100G.
- We can have 48 servers in a rack. This means some ports facing towards the server will be empty.
- We need to support hosts in the order of 50K or more. This means we need to create a Three level Fat-Tree ( FT(64,3) ) as a Two level Fat-Tree ( FT(64,2) )with the 64 ports is not going to support the hosts we need.
This gives us
- With k = 64, we get 32 ports towards server. This gives a capability to host a total 64 (32 x 2) servers with 25G connectivity on each switch. But we only need to support 48 servers.
- We will have a FT(64,3) Fat-Tree or a Five stage Clos.
- Three level Fat-Tree will give us total of k pods i.e. 64.
- Each pod has k/2 leaves i.e. 32
- 32 agg switches per pod
- 32 sets of spines with 32 spines within each set.
- Our topology will like below which is a pattern we have already seen with our previous examples and the only difference is the number of switches at Spine, Aggregation and Edge as switch radix is different.
Now as you can see that this fabric is huge and we may not want to deploy the whole fabric capacity at once. We can introduce some oversubscription at the Edge layer this time and slowly increase the fabric capacity as the demand comes.
So let’s say we want to have an oversubscription of 6:1 at the Edge layer. This will result into 4 upstream switches at the aggregation layer per pod.
Since we only need 4 switches at the aggregation layer per Pod, we only need 4 spines per spine-set. If you didn’t pay attention to the connectivity pattern until now, now is the time to do it by going back and looking at our earlier topology examples. Having 4 spines per spine-set maintains the 1:1 oversubscription at the aggregation layer.
If we rearrange the above topology and aggregate all the spines connected to respective aggregate switches together in one group. It will start looking like below. If you notice, all the 32 spines connected to Aggregation switch 1 in Pod 1 are now grouped together and are in one spine-set.
It’s time for us to add some pixie dust magic and rename things like spine-sets 1,2,3, 4 to Planes 1,2,3, 4 respectively. Let’s also color each plane.
And if you notice, we have derived Linked In’s DC fabric.
What if we want to make our oversubscription ratio better ? Let’s say we want to make it 4:1 from 6:1 at Edge. In order to achieve that, we will add 2 more switches at the aggregation layer (from 4 to 6) in every POD and two more planes (from 4 to 6) like below
And if you try to do a similar exercise for Facebooks DC fabric, presented here https://code.fb.com/production-engineering/introducing-data-center-fabric-the-next-generation-facebook-data-center-network/ by building a Fat-Tree with FT (96, 3) i.e. Switch radix = 96, three levels and right oversubscription ratios, you should be able to come pretty close to what they have.
Conclusion
So we started with some very fundamental concepts, covered clos and Fat-Trees and then looked at few examples. We also looked at some publicly known fabric and decompose them with our understanding of Fat-Tree design. Hopefully you will find this post helpful.
References
https://engineering.linkedin.com/blog/2016/03/the-linkedin-data-center-100g-transformation
http://networks-sy.ece.gatech.edu/wp-content/uploads/sites/536/2016/11/topologies-II.pdf
http://liulonnie.net/ftree-vis/
A Scalable, Commodity Data Center Network Architecture
A study of Non-Blocking Switching Networks
Principles and Practices of Interconnection Networks
https://www.firstclassfunc.com/2014/11/facebook-fabric-networking-deconstructed/
Went through both parts , great intro for the fat-tree and clos topologies, many thanks for that !
Thanks Marius.
Nice summary and discussion. Expanding your tool to have a few other options would be useful. For example, we have found value in keeping the number of core-to-leaf links per leaf switch to be a minimum of two, for large clusters, for reliability and redundancy purposes. This also works better when runnign a large number of differently configured HPC workloads. So in our latest cluster, we run individual links to each worker node but keep the fat tree at half its ultimate possible size. This changes the number of switches in your calculation and would be valuable to put in as a parameter.
Alan, Thanks for the kind words. I am assuming the links between core-to-leaf are not part of LAG and are presented as individual links from a routing perspective?
Yes, the number of switches will change and i will try to update this article with that as a paremeter.
Alan, Thanks for the kind words. I am assuming the links between core-to-leaf are not part of LAG and are presented as individual links from a routing perspective?
Yes, the number of switches will change and i will try to update this article with that as a paremeter.