When designing a network, how much does latency really matter? For our purposes in this discussion, I’m defining latency as the amount of time it takes for a packet to leave a source and arrive at a destination. In other words, network latency specifically.
Local area network latency.
A local area network is one where all nodes are geographically quite close. As the cable runs, endpoints are typically within hundreds of meters of each other or less. Campus LANs can grow to be larger than this, where endpoints could conceivably be a kilometer or more apart. WAN latency is a bit of a different discussion. We’re not going to consider that here.
In optimally designed LANs, the hop count of physical devices also tends to be low. A computer in an office building might hop through a closet switch, aggregation switch, core switch, and then rack switch to arrive at a server hosting an application.
How much latency are we talking about here? Sub-millisecond on wired networks — network latency that can be measured in microseconds. Probably a few milliseconds when wifi provides the access layer, assuming the wireless network is performing reasonably well.
In other words, with LANs, there is not a whole lot of delay involved in the network transport to deliver a packet when everything is working well.
When does latency become a concern for LANs, then? I ask this question because switch manufacturers often list port-to-port latency as an important statistic. Should the average network engineer shopping for a switch take port-to-port latency into consideration?
In most cases, a lower port-to-port latency will not improve application performance meaningfully on the average LAN. Buying a switch with a 500 nanosecond latency instead of a switch with a 1µs port-to-port latency will not result in a major benefit to application performance.
Network latency is only one part of transaction latency, the amount of time it takes an application transaction to complete. Transaction latency involves not only network latency, but also latency introduced by authentication, database I/O, storage I/O, plus any other compute tasks that must be completed for the transaction itself to complete.
Assuming a network that is not dropping packets due to congestion or failing hardware, network latency is not likely to be the largest part of the overall transaction duration. Nanoseconds are tiny slices of time that in the grand scheme of a larger transaction aren’t likely to be impactful, at least not in a typical enterprise.
You can rightly argue that shaving nanoseconds here and there all adds up. Shave enough nanoseconds by purchasing switches with lower port-to-port latency, and you’ve got a microsecond or two across the entirety of the network path. Shave enough microseconds, and you’ve got a millisecond. If you consider the number of milliseconds possible when spread over thousands or millions of transactions a day, it seems like shaving latency anywhere possible matters.
Well, maybe. Numbers are numbers and math is math. My contention isn’t that you can’t reduce transactional latency by purchasing a switch with a lower port-to-port latency. Rather, my contention is that if perceptibly improving application performance (users can tell the difference) is critical, then you’ve most likely got a slow application that’s slow for reasons other than network latency. A faster switch isn’t going to fix that problem. Network transport isn’t usually a big slice of the transactional latency pie.
What does fix slow applications? A careful understanding of the application, including precise knowledge of every aspect of the application’s transaction. To fix an application that is underperforming, you must know each task that the application performs during a transaction, and how long each task takes. Throwing more hardware at the problem might be appropriate, but that’s an expensive guess unless the application transaction processes are well-understood, all latency components quantified, and the bottlenecks, aka culprits, identified.
When does port-to-port latency matter in LAN switching?
I believe that port-to-port latency matters in LAN switching in data center scenarios where hosts are densely populated. The issue is contention for the switch. Lots of hosts that are running at maximum effort (i.e. high NIC utilization) benefit from switches that minimize the amount of time a frame is in process.
However, this is a scenario (arguably) where the problem being resolved isn’t transactional latency. Rather, the issue is maximizing network fabric throughput under a circumstance of extremely high utilization. While a better performing network fabric will certainly benefit application performance, the implication is that applications in this sort of environment are already well-understood. Reducing network latency by nanoseconds would be actually understood to be relevant in the overall scheme of application delivery by the environment owners.
Other might have additional thoughts here.
The view from the hot aisle.
In most application performance troubleshooting I’ve been part of over the years, the network is the first to be blamed, but the last to be an actual culprit. Yes, it happens at times that the network is underperforming. When this is the case, the problem is usually manifested as dropped packets. Optics go bad and forward inconsistently. Chassis switch line cards get old and on-board ASICs can fail. Congestion can happen, where not all traffic will make it through a network bottleneck.
Those scenarios are, in my experience, rare. Instead, there’s usually some other part of the application transaction that’s at the root of a slow transaction time. This is particularly true if the application was performing well on Monday, but slowed to a crawl on Tuesday.
The most common culprits I’ve run into?
- Timeouts, where an app waits for an answer from an unresponsive service before proceeding.
- Slow DNS, where an app makes a call to a DNS server that’s taking a second or more to respond.
- Slow database, where an app needs to interact with SQL, but the SQL server is overloaded.
- Slow storage, where an app must touch disk to complete a transaction, but the storage system is overloaded or has experienced a hardware failure it’s trying to make up for.
The key is to be able to identify where the latency is happening. Assuming you don’t have access to an application performance monitoring tool, you can always fall back to Wireshark.
Isolate a conversation between two endpoints, where the server on the one side and client on the other are part of an application transaction that’s reported to be slow. Capture at both the client and the server. Look at interpacket gaps. When the client makes a request, how long does it take for the server to respond?
By analyzing both traces and comparing timestamps, you can determine both network latency — how long it’s taking packets to traverse the network between client and server — and the rest of the latency. The “rest of the latency” is how long the server chews on things before sending the response back to the client. Is that milliseconds? A half second? Two seconds? Ten seconds?
If the server is churning excessively, then network latency has been eliminated as the issue, at least between client and server. The problem solving process can move ahead to the processes running on the server. The goal is to find out exactly what part of the transaction is causing the issue.
Again, in my experience, this isn’t likely to be the network. That said, it’s important to approach these issues on a quest for truth, and not a quest for exoneration. Knowing what the problem isn’t doesn’t fix the problem. The root cause must be found. If the network latency between client and server is fine, don’t forget that there’s going to be network latency to quantify between server and database, server and directory server, server and network attached storage, or server and ($deity help us) public cloud resource. You get the idea.
Even so, the solution to a slow application isn’t likely to be buying switches with lower port-to-port latency. But to move the ball ahead when troubleshooting or designing infrastructure, you need hard data.