Recently a colleague and I encountered a very perplexing issue with one of our ASAs. Our team had reports of users experiencing random timeouts to a web server and the problem was “hot potato’d” around various members of the group, finally piquing my curiosity. While looking at a packet capture, we noticed some oddities. First we saw that the firewall was stripping out the SACK and replacing it with a NOP during the 3-way handshake (thank you Laura Chappell!). This wasn’t ideal and made me curse profusely, but it wasn’t the cause of the timeouts. It also seemed to be setting the TTL on some HTTP packets to 1. What the ….?! My co-worker drew the short straw and opened a TAC, while I went to Twitter and sent a message to @ciscosecurity. We both received similar answers:
Thanks for being so patient. It is greatly appreciated. As discussed
on our call, this is expected behavior. The “feature” is called
ttl-evasion-protection.
The way the TTL evasion mechanism works is that every packet in a
direction will always get the lowest TTL that has ever been seen in that
direction for that flow. Each flow is tracked separately.
http://www.cisco.com/en/US/docs/security/asa/asa72/configuration/guide/protect.html
This should get around the feature
tcp-map ttl_workaround
no ttl-evasion-protection
policy-map POLICY_MAP_NAME
set connection advanced-options ttl_workaround
Hope this solves the mystery!
It turned out that the server (for some unknown reason) was sending out an HTTP packet with a TTL of 1 and due to the TTL-evasion-protection mechanism, the ASA continued to use this TTL for subsequent packets in the flow. And we all know packets with a TTL of 1 don’t go very far. But this didn’t solve the mystery of WHY the server would demonstrate such strange behavior. After talking to another co-worker who had encountered a similar issue with mysteriously dropped packets and poor network throughput, he suggested that the root cause was the TSO (TCP segmentation Offload) driver on the server in conjunction with the TOE (TCP Offload Engine) on the network card.
“large segment offload (LSO) is a technique for increasing outbound
throughput of high-bandwidth network connections by reducing CPU
overhead. It works by queuing up large buffers and letting the network
interface card (NIC) split them into separate packets. The technique is
also called TCP segmentation offload (TSO) when applied to TCP, or
generic segmentation offload (GSO).”
….
“TCP offload engine or TOE is a technology used in network interface
cards (NIC) to offload processing of the entire TCP/IP stack to the
network controller. It is primarily used with high-speed network
interfaces, such as gigabit Ethernet and 10 Gigabit Ethernet, where
processing overhead of the network stack becomes significant.
The term, TOE, is often used to refer to the NIC itself, although
circuit board engineers may use it to refer only to the integrated
circuit included on the card which processes the TCP headers. TOEs are
often suggested as a way to reduce the overhead associated with IP
storage protocols such as iSCSI and NFS.”
This guy probably spent hundreds of hours testing and researching this problem. He finally admitted to a rather drastic solution, removing the TOE chip from the NICs in multiple servers. From my own research, the firmware on the card can be “problematic” and when the kernel driver is enabled (in Linux or VMware), odd behavior can sometimes be observed, including dropped packets, resets or suboptimal performance. But there’s lots of controversy surrounding this issue. As noted in some of the information I found below:
https://b.kentbackman.com/tag/tso/
http://permalink.gmane.org/gmane.linux.drivers.e1000.devel/627
http://lwn.net/Articles/149941/
http://www.mail-archive.com/[email protected]/msg01690.html
http://www.thesubodh.com/2012/05/toe-tcp-offload-engine-on-nic-packet.html
http://v-front.blogspot.com/2011/05/network-troubleshooting-part-iii-real.html
https://bugzilla.redhat.com/show_bug.cgi?id=485292
I presented a few options to the sysadmin.
- Add a policy map for this server, but this may be a security issue and
it probably isn’t easily documented for others encountering this odd
entry in the policy. - Check to see if there’s a firmware update for the network card.
- Check to see if there’s a patch for the OS.
- Disable/turn off TSO, but there’s a possibility of a performance hit
to the CPU. - Replace the NIC or remove the TOE component from the card.
Ultimately, we resolved the issue by adding a policy map in the firewall for the web server with the sysadmin disabling TSO in the linux kernel, but I’m not even sure these were good choices. Especially because TOE/TSO is supposed to be a performance enhancing feature. Just seemed to be the most expeditious choice when a production service was intermittently unavailable with lots of unhappy users. Guess we were just collateral damage in another bufferbloat drive-by. Maybe Jim Gettys is right and the internet really is broken.