Cisco has released OS version 9.0.1 for the popular and ubiquitous ASA firewall. One of the new features Cisco is touting is firewall clustering. We got talking about ASA 9.0 clustering on a podcast recording we did over the weekend, and we hit a few points based on the official Cisco configuration guide. That show is in the queue with several others, so I can’t promise when exactly it’s going to be published, but keep your eye out for the show with Brent Salisbury and Bob McCouch. We talked for 2+ hours I think, so it will probably be released in parts. But that’s an aside.
This morning, I got some Cisco inbox spam that linked to a presentation entitled “Enterprise-Class Security at Data Center Speeds – Clustering With Cisco ASA.” It’s an archived presentation from October 2012, and registration (using your CCO account if you like) is required. I gave it a view, and took a lot of notes.
Why cluster ASA firewalls?
- So you can scale throughput way up there, up to 100Gbps.
- For high availability. Like traditional active/standby ASA firewall HA pairs, ASA clusters offer redundancy.
What are the basics things to know?
- For full path redundancy, you can cluster up to 8 firewalls to a VPC Nexus pair, or Cat6K VSS pair. In other words, MLAG topologies are supported. Obviously, you don’t have to plumb the firewalls to Cisco switches. Plumb then to whatever you want; LACP is your friend if you’re doing a layer 2 load-balancing method across the cluster. More on that in the “how” section.
- There is a cluster license, of course. I haven’t priced it, but my expectation is that the clustering license will be reassuringly expensive (as Greg Ferro puts it).
- There’s a clustering dashboard, which gives you a unified console, but can still drill into any single node if you want. The general idea is that you manage the cluster as an entity (including policy), and not individual firewalls.
- Hitless upgrading is supported, meaning you can upgrade one cluster member, put him back into the cluster, then upgrade another cluster member, until all the nodes are upgraded. Your traffic isn’t supposed know the difference. I have upgraded many ASA firewall HA pairs and never had an issue with traffic interruptions, so I presume this will actually work as advertised.
- The cluster is fully managable via ASDM 7.0 or the CLI if you prefer.
- When running the Packet Capture Wizard in ASDM, you can capture for a single node or the whole cluster. I presume that same single vs. cluster functionality is at the CLI as well, but Cisco didn’t demo it.
- When adding new members to an ASA cluster, throughput scales linearly, netting you about a 70% throughput improvement per node added to the cluster. Whether you get more or less than 70% will depends on what sort of traffic you’re pumping through the cluster. A lot of the traffic is hardware accelerated, so I’m not overly skeptical here. 70% is probably a good baseline estimate, but if sizing the cluster is really important to get right in your environment, you might want to obtain some BreakingPoint boxes, a demo ASA cluster, and test.
How does an ASA cluster work?
- Pick one of the supported stateless load-balancing methods. (1) ECLB (equal cost load balancing – etherchannel), PBR (policy based routing), or ECMP (good old layer 3 equal cost multipath). ECLB/etherchannel seems to be the preferred method, with Cisco noting that it was the easiest to deploy. They also refer to ECLB as “span mode”.
- All ASA modes are supported when clustering. So, you can still choose to do single vs. multi context, transparent vs. routed, or mixed mode.
- Backups of each session are spread around the cluster. “Session to session HA” is highly distributed. Not every cluster member knows everything there is to know about all flows that might be transiting the cluster; these responsibilities are spread around.
- The data path is such that traffic always flows through the cluster in the same way to allow for inspections. There is a “consistent flow representation inside the cluster”. Therefore, asymmetric traffic flows do not present an issue, as the cluster re-directs asymmetric egress flows to match the ingress flow cluster member. More on this in the next section.
- State is shared between cluster members. But, not all cluster members. The primary ingress cluster members mirrors state to one other cluster member that is his backup.
- Load balancing within the cluster is accomplished using Cisco-proprietary Cluster Control Protocol; think of this as the “cluster back plane”. This is used to redirect asymmetric traffic, mirror state information between cluster members, share redundancy information, and perform cluster maintenance.
- The cluster control link is used to monitor load and assign flows to other owners to keep the load evened out across members. This ensures that no one member is overloaded.
- The term “cluster data plane” is used to describe the transit traffic flowing through the cluster…not control traffic.
- Console outputs are replicated to the master, so that you can see the output of all cluster members in one place.
- If the cluster is running in layer 3 cluster mode (instead of etherchannel/span mode), then each cluster member gets an IP address on each L3 interfaces so that the cluster members can communicate with each other. Virtual MAC is recommended in this case.
How is asymmetric traffic handled (ingress via one cluster member, egress via another)?
- Initial ingress SYN is sent to some member of the cluster based on the load-balancing method you’re using. The receiving cluster member is the owner of that particular flow.
- The owner will perform what Cisco termed a “consistent hash” of that flow to select another member in the cluster. (This hash being consistent is pretty important, as that’s how all the members of the cluster determine who the director is for a given flow. I’d love to see that algorithm.) That cluster member is the backup (of the session, where state information is mirrored) as well as the director. The function of the director is to be a lookup service for the other cluster members. The lookup service is used by cluster members receiving asymmetric flows, so that they can identify the flow owner.
- Return traffic could arrive on a forwarder: a cluster member with no knowledge of the flow. The forwarder will then ask the director who owns the flow. The forwarder then forwards the entire packet up to the owner, who processes the packet appropriately.
- This forwarding methodology means that not all cluster members have to know the state of all flows. This allows for linear scaling. If you mirrored all flow states to all cluster members, the traffic flows related to state mirroring would negatively impact overall cluster throughput. You’d end up with a logarithmic throughput gain when adding new cluster members instead of a linear one.
What happens to an established TCP session during a failure?
- Let’s say the owner fails.
- Correspondingly, the LACP link that was connected to that now dead cluster member goes down.
- Switches will redistribute the load, and ingress/egress sessions are very probably asymmetric.
- The cluster node receiving the flow that had been established through the now failed owner queries the director for that flow.
- The director determines a new owner for the flow: whatever flow got the first packet after the initial owner failure will end up becoming the new owner as determined by the director.
How do you handle NAT/PAT with an ASA cluster?
- This part of the presentation went quickly, and they didn’t offer much detail in the allotted time. But I did glean a few points.
- The most likely deployment scenario for a huge ASA cluster is in a data center. In that scenario, you’re not likely to be using NAT/PAT. But…let’s say you are anyway.
- Assign a PAT address to each cluster member. You want all of the cluster members to take part in the translation process.
- Use address pools; each member in the cluster will get one owner address from the pool to do translation with. Another cluster member will be assigned as the backup for that address.
Cisco’s focus in the presentation was on building a beastly ASA firewall cluster that could handle 10s of gigabits of throughput in a data center environment. I anticipate the pricing structure will reflect this. I am unclear as to whether clustering will be available across the ASA product line, or whether this will be limited to specific hardware platforms. I can see an application for 2x and 4x firewall clusters in smaller settings at the WAN edge, especially where a high number of connection counts and/or high-level inspections are required.
Also, this is ASA 9.0.1 code right now. Meaning…you probably don’t want to run this just yet. Stability in the ASA code line has been embarrassingly poor in the 8.4 code train, and I doubt 9.0 is bringing stability improvements there. I, personally, would wait at the very least for 9.0.2 code, and then only after it’s been out for 3+ months with no fatal/critical bugs logged. Clustering is complicated technology, and while the ingress/egress load-balancing component is simple enough (relying on external switches or routers to do it), the traffic flows within the cluster are complex (owners, directors, state mirroring, consistent hashing, load distribution across members, etc.). When there’s a lot of moving parts, there’s a lot of opportunity for something to go wrong. Let someone else find the bugs for you, because I have a feeling there’s going to be a pile of them as 9.0 sees wider deployment.