This post will cover several most common QoS configuration techniques for Multipoint VPNs. Focus will be on FlexVPN DMVPN, however most of the conclusions will be applicable to the traditional DMVPN as well.
Most of the FlexVPN and DMVPN deployments use Internet as a WAN transport. This creates a very unique set of requirements and restrictions for QoS configuration. Internet will not guarantee end-to-end QoS or MPLS-grade SLAs, therefore all QoS enforcements points must be located within the network.
With the proliferation of SaaS applications some of the public Internet destinations become business critical (e.g. Outlook365, ServiceNow) and need to be protected from the rest of the Internet traffic, therefore both inter-site and Internet traffic must traverse the same device (a FlexVPN router) in order to create a single QoS enforcement point.
Throughout this post I will refer to a nested CBWFQ QoS policy simply as “standard” QoS policy. This policy will have a parent shaper, limiting the overall traffic bandwidth and a child CBFWQ with application-specific bandwidth reservations. This is how a typical standard QoS policy may look like:
priority percent 10
bandwidth percent 5
bandwidth percent 60
shape average percent 100
Inbound QoS is extremely important in networks with non-QoS-aware WAN, like Internet. Since it’s impossible to have inbound QoS configured on the ISP’s last-hop router the only way to ensure QoS requirements are met is to move the congestions point further inside the network. A technique called “Remote Ingress Shaping”  assumes that a standard QoS policy is configured in the outbound direction of the WAN’s router LAN interface. The parent shaper of this policy needs to be set to 90-95% of Internet downlink speed which means that FlexVPN router will start pushing back on incoming TCP throughput by queueing and dropping packets before it happens on the ISP’s last-hop router. TCP will react to that by dropping its throughput and freeing up bandwidth for real-time UDP traffic. This approach works well with TCP-based traffic, however it won’t have any effect on the excessive UDP traffic which can only be controlled by the sending side.
Physical interface QoS
This approach assumes that a standard QoS policy is applied to a physical WAN interface. All packets egressing the GRE tunnels established through this physical interface will have their DSCP values copied down to GRE/IPSec headers and get treated correctly by the interface’s policy. In case the CBFWQ class-maps match packets based on IP or port numbers,
qos pre-classify command needs to be applied to all GRE tunnels. However, generally it’s considered a best practice to mark all packets before they egress the WAN interface (e.g. inbound on WAN router’s LAN interface), since this will allow to have a WAN QoS policy matching only based on DSCP values.
This approach works best for branch sites with bandwidth relatively similar to its peers. However when the WAN link’s bandwidth of a site is a lot higher, which is normally the case with DC/Hub sites, it’s best to control the traffic on a per-spoke basis in order not to overrun the receiving side.
Per-tunnel QoS feature
Cisco MQC QoS framework was “extended” to meet the new requirements imposed by Multipoint VPN topologies. The official guide describes the QoS configuration for traditional DMVPN networks. It talks mainly about a new NHRP construct called “group” which can be defined on the Spoke and is used to select a particular QoS policy on the Hub. It also vaguely mentions that another policy can be configured on the physical interface to control the aggregate traffic egressing the WAN interface. As far as I’m aware, this is the only official per-tunnel QoS configuration guide available from Cisco so I’ll assume FlexVPN inherits the same properties. Based on this document QoS configuration will consist of two main parts:
1. Physical interface shaper
2. Tunnel-specific shaper
“Standard” QoS policy can only be configured on tunnel interfaces. In DMVPN networks Hub defines NHRP group to QoS policy mapping statically on its mGRE interface. Every time a Spoke connects, a new point-to-point GRE tunnel is spawned and is assigned with a corresponding policy matching its NHRP group. In FlexVPN networks QoS policy selection is done by an authorizing Radius server and the choice is based on a user-defined attribute (could be any portion of IKEv2 ID or X.509 certificate). However in both DMVPN and FlexVPN the end result is that a policy, that was pre-configured on the Hub, is assigned to a every P2P GRE interface.
Physical interface shaper serves as an upper bandwidth boundary for all outgoing flows and is logically “linked” to all the tunnel policies. This link is visible, for example, when tunnel interface’s parent shaper is configured with a relative value (e.g. shape average percent 100), in which case the absolute value in bits per second is calculated based on the physical interface’s shaper policy. Because of this “link” it’s also impossible to change any QoS settings on the physical interface without first shutting down all the GRE tunnels established through it. Thankfully IOS parser will be kind enough to notify us about that with a message like:
Remove tunnel/session policy first before attaching policy to main int. And later install tunnel/session policy.
Service_policy with queueing features on this interface is not allowed if tunnel based queuing policy is already installed.
Another limitation of per-tunnel QoS feature is that the physical interface can only have a class-default a.k.a. “parent” shaper. It is possible to apply a full-blown nested CBWFQ, but use it at your own peril as it may not be supported by TAC.
The biggest limitation of per-tunnel QoS feature is that its behaviour changes in the event of congestion. When there are multiple tunnels established over a congested interface their bandwidth allocations do not remain proportional to the bandwidth of the tunnel, but equalise which defeats the purpose of having different QoS policies for Spokes. To demonstrate that I’ve done a little experiment described below.
Per-tunnel FlexVPN QoS testing
To see how traffic flows behave under congestion with and without QoS I have created a simple FlexVPN topology in UnetLab. To generate flows I used IOS built-in traffic generator TTCP. It’s available on most IOS platforms and can be started with
ttcp receive and
ttcp transmit $destination_ip from global configuration mode. It measures TCP throughput in kilobytes per seconds.
Test topology consists of a single Hub with 3 similar traffic sources sitting behind it and 3 destinations – 2 FlexVPN spokes and a router representing a host on the Internet. Each source will generate traffic to the same destination in every scenario thereby creating three distinct flows:
Flow #1 – Source 1 to Internet
Flow #2 -Source 2 to Spoke 1
Flow #3 – Source 3 to Spoke 3
Each simulation will consist of three stages:
1. Measuring throughput of each individual flow.
2. Measuring throughput of two concurrent flows (x2)
3. Measuring throughput of all three flows competing for bandwidth (x3)
Each of these stages is represented by a column in tables below. To provide additional confidence level and account for random variation, each value in the below tables is an average of 3 independent measurements.
At this stage no QoS is configured on any of the interfaces. First column contains values measured for each flow individually with no other flows contending for WAN link’s bandwidth. We can see that maximum throughput was achieved by Flow #1 (unencrypted “Internet” traffic) with two other flows each getting around 1MBps throughput, hence the ratio of bandwidth utilization of the three flows will be 4:1:1
|No QoS (FIFO)||No QoS (x2)||No QoS (x3)|
When multiple flows egress the same interface, effective bandwidth drops to around 3.3 MBps and the distribution of bandwidth changes to 2.5:1:1 which means the initial ratio of 4:1:1 doesn’t hold. There’s also no clear explanation why the bandwidth of Flow #1 has decreased by almost 50% while the other two flows only lost 20%.
For the next round of testing I’ve configured a simple physical interface shaper limiting the outgoing bandwidth to 2Mpbs. Flow 1 will only be restricted by physical shaper’s bandwidth. Both Flows 2 and 3 will get their bandwidths assigned with a
bandwidth qos-referece $BW command using an IOS-local radius. Flow 2 will get a bandwidth of 1.5 Mpbs while Flow 3 will only get 1Mpbs. This unfair distribution is done deliberately to show how QoS scheduler will allocate bandwidth to flows with different requirements. Without contention each flow gets the TCP throughput equivalent to the configured values and the bandwidth utilization ratio for the 3 flows is 4:3:2.
|Full QoS||Full QoS (x2)||Full QoS (x3)|
When we start having more than one flow traversing the interface the bandwidth provided by QoS scheduler equalises. In the last experiment with all 3 flows competing for bandwidth the distribution is close to 1:1:1 which is far from the initial 4:3:2. That means that the parent physical shaper performs a simple fair queueing on all flows egressing the interface. This result is expected since all flows end up in the same class in the physical interface shaper which by default implements flow-based fair queueing policy. That means that the Hub’s link should always remain underutilized in order to guarantee the promised bandwidth allocations.
Update: Apparently ASR routers have a special command “bandwidth remaining ratio” which, when configured in the parent shaper, can avoid this fair queueing behaviour. However this command doesn’t exist on normal ISR routers.
Although certain QoS mechanisms can be implemented for FlexVPN or DMVPN solutions they are all have their flaws. Increasing bandwidth still remains the best solution to fight QoS-related issues. The biggest problem is that IOS MQC QoS framework was not designed to manage policies over multiple logical interfaces at the same time. That is where SD-WAN products with centralised controller having a holistic view of all network interfaces and policies may help. However, until we wait for those products to gain traction in wider enterprise market we still need to configure QoS the “old” way.
 – CiscoLive, Designing Multipoint WAN QoS