This is a sponsored blog post by Francois Caron, SolarWinds® Director, Product Management.
With the end of the year fast approaching, many IT organizations see a slight slowdown toward the end of the year, and often use this time for catch-up projects and budget planning. Maybe you have some left over budget you need to spend on network hardware this year, and certainly your management wants a strategic plan for next year. Your boss may have even asked for a specific list of network assets that need upgrade/replacement for his budget, because they are overloaded or will be soon.
Often you’re asked to estimate bandwidth needs based on a SWAG, but with just a few minutes and Excel® you can accurately quantify your true hardware needs. So then, what is the best way to quickly gather capacity information to support your company’s planning activities – “what do I need to budget for?”
Few of us are lucky enough to have a sophisticated capacity planning system, but we do have the all the data required for sophisticated analysis to support hardware capacity increases.
Defining the expected load on your critical resources
With a little work, you can define your expected capacity requirement using the historical load measurements you already have.
Go to your favorite Network Management Software (NMS), use a free tool or query your device CLI, and get a sense of load on an element, let’s say traffic on a WAN interface. Industry standard is to monitor your WAN network traffic interface load by sampling every 5 minutes, but you get even better results with shorter sampling intervals, down to 1 minute if you have storage for the data.
Collect data for a reasonable period that includes typical operation over different days of the week and ideally occasional high-demand periods. Ideally you could get at least two months of data, (288 per day at 5 minute intervals). You can ignore weekends if the load is so low that it will skew your typical day load demand results. Use the reporting feature of your collector and export the data to Excel®. You can leverage Excel® to directly connect to your NMS (such as SolarWinds’ Network Performance Monitor database). In Excel® 2010, open a new file, click the Data tab, then select From Other Sources and From SQL Server. Enter the server name of the database server used for NPM and set credentials that have rights to read data from the database.
You can then select the table to extract the data from and where you want to import it to in order to use Excel® to do the analysis. A more detailed summary of the Excel® export approach is available in Mav Turner’s blog on Getting More from the Data NPM Collects.
While it may be tempting to take easy approached like averages or using peak capacity spikes, these approaches will underestimate or overestimate capacity needs, respectively. A better approach is to apply the 95th percentile method that lets you can quickly drill down to the samples that represent your important traffic.
Without too many details, if you want to plan your interface to be sized correctly 95% of the time (i.e. it will be saturated only 5% of the time), you should aggregate your 288 samples based on the 95th percentile method. Plenty of articles on the internet define how this works. But here is a simple way to explain it: take your 288 samples in Excel, sort them by increasing order, remove the top 5% (14 highest values), and the highest value remaining represents the 95th percentile of your daily sampling.
Excel even has a built-in formula to calculate the percentile of a sampling. (Google “95th percentile Excel” and you’ll find some great how-tos).
What are critical resources to track?
Ports, interfaces and trunks are of course the most critical, but you might also want to analyze the capacity of CPU utilization and memory load. GoDaddy’s DNS outage from router table memory exhaustion is a great lesson in the need to assure memory headroom. Remember that not only physical, but also virtual resources can create bottlenecks and reduce the quality of service if improperly sized. Your router’s QoS mechanism is a good example of a virtual capacity that needs proper load estimates. These are great contributors to quality of service, because they create buckets of bandwidth reserved on physical interfaces for identified traffic. For example, a VoIP class of service definition can guarantee low latency and jitter required for high quality VoIP calls, but if sized incorrectly, you can create bottlenecks and not meet your VoIP quality SLA. If you discover that virtual resources such as Classes of Service are too small, (Cisco’s CBQoS [Class Based Quality of Service] MIB helps), you can always increase the class-of-service capacity. The key, however, is you must understand the maximum physical interface capacity and the defined limits of other CoS maps on the same shared interface. You can use the 95th percentile method to produce realistic estimates for all these capacities based on a little bit of historical data sampling.
Finding the resources that need a capacity increase, with the Date to Threshold
A common technique that you can use to detect the resources that need a capacity upgrade:
- Take as much data as you can (optimally 2-3 months or more although that gets difficult towards the end of the year) , of daily percentiles (1 value per day)
- Trend and forecast the future (linear regression is fine to start)
- Eliminate all interfaces that don’t have a growing trend
- Write a formula that detects when the trend line hits the physical capacity of the interface (note that many interfaces have different physical capacities)
- All the interfaces hitting that threshold in the 3, or 6 or 12 months, depending how conservative you want to be, should be scheduled for a capacity upgrade
Hopefully, this helps you approach this interesting problem in a methodical way, with just your NMS to collect data, Excel and a bit of time. Once the Excel® spreadsheet is built, you can reuse it to make this exercise much easier. One last tip: long term predictions are usually not super reliable, so it’s better to do it at least every six months, and more if you have time. Ah, if only we could also increase the provisioned bandwidth in our spare time as easily.