Jeff Behl, Chief Network Architect with LogicMonitor, is our guest author for this post. Jeff has been in the IT industry for over 20 years. He has an extensive background on architecting enterprise networks and data centers and brings real world knowledge around network operations from start-ups to enterprise companies. These companies range from UC Santa Barbara, Citrix Online, and ValueClick to name a few.
LogicMonitor is a SaaS-based performance and monitoring platform servicing clients across the world. Our customers install LogicMonitor “Collectors” within their data centers to gather data from devices and services utilizing a web application to analyze aggregated performance metrics, and to configure alerting and reporting. This means our entire operation (and therefore the monitoring our customers are dependent on) relies on ISPs to ensure that we efficiently and accurately receive billions of data points a day.
Detecting Major Outages
The logic for alerting on potential data center wide or major network issues is fairly simple: if we have not heard from a Collector after a few minutes, we consider the Collector to be down. This may indicate a data center wide power or network issue, and we have a special class of alert for these cases.
However, what happens if the network path between a Collector and our data center is blocked due to a transit ISP issue? This used to be a problematic issue for our more remote clients (e.g., those in Australia) where the paths to the U.S. are less resilient and prone to brief, periodic lapses in connectivity due to a multitude of transit ISPs.
Starting With LogicMonitor’s ISP
We manage and control our own infrastructure. Our ISP of choice is InterNAP, where we have multiple uplinks and private BGP peerings per data center location. InterNap’s model differs from traditional tier 1 ISPs because they do not have their own physical network. Instead theypeer with numerous ISPs at each of their locations, and have their own route optimization technology that takes latency into account, unlike regular BGP. In many ways, they provide us outsourced BGP peering management with a NOC solely focused on peering with other providers.
Even with InterNap’s redundant peering, ISP transit problems can still occur among their direct peers though they are exceptionally quick with identifying issues. So the question remains: how do you minimize erroneous Collector down alerts due to transit ISP problems?
One idea we considered (as some of us had successfully implemented it at other SaaS companies) is to obtain transit from other ISPs, but not use BGP to route over them – instead use policy based routing. If a packet comes in via ISP A, policy based routing can be used to send the return packets out via ISP A – regardless of what the shortest BGP path dictates. If the Collector tries all the routes to its destination – BGP based and policy based – the chance of connection increases, as it would not be subject to last hop BGP convergence, and may be able to use different paths to come in/out via the different ISPs.
Our ‘Simpler’ Solution: Proxies built on Amazon’s Elastic Compute Cloud (EC2)
We also could have decided to start peering directly with additional ISPs with the idea being that more connections equals a greater likelihood for transit success, but a simpler solution that avoids us having to manage our own BGP peerings was to deploy a number of proxy servers in various EC2 regions. Proxies receive requests from Collectors and forward them (storing nothing on disk) to LogicMonitor’s data centers.
A single domain name is configured with four ‘A’ records, each pointing to the public IP of a proxy. The Collectors are configured to lookup the name and randomly select one of the IP addresses if unable to communicate directly with our data centers. If a specific proxy is unreachable, or ifthe proxy reports can’t transfer a request to our data center, the next ‘A’ record (proxy) is tried. Often times the path to each proxy is different because of the various routes from EC2 locations and the next hop before InterNap to our data centers. (For example, the direct route from Australia to Los Angeles is likely to use different ISPs than the routes from Australia to Singapore, and Singapore to Los Angeles). The Collector falls back to the direct path when probes establish restored connectivity.
This simple yet effective setup routes around various peering issues. This reduces erroneous Collector down alerts due to transit ISP issues and preserves one of the maxims of monitoring: “Do not send out bogus alerts.”
We extensively monitor these Collectors and the amount of traffic proxied through them. Below is a LogicMonitor graph that presents the daily requests to our proxies:
In rare occurrences when an ISP in close proximity to us experiences transit problems, it can affect a larger swath of Collectors. In these times we see the number of request to proxies jump, but the number of Collectors declared down does not:
Anytime the proxies are carrying a Collector’s traffic for more than a few minutes, we have routed around an Internet transit issue, and avoided a false alert.
It is our belief that hybrid infrastructure will become the norm, encompassing corporate applications that span on-premise servers and public cloud servers, as well as applications that may be built partially on cloud provided services (such as DNS, email, BGP, databases, or monitoring, in our case).
Utilizing public clouds effectively and securely takes a great amount of effort. In a subsequent post we will go into the setup of individual proxies – which in order to fit our security model involves internal and external networks, VPNs, and Linux based policy routing.
Interested to try LogicMonitor in your environment and start getting instant visibility into your IT stack? Get your free trial here.
Questions, comments or related IT tales? Chime in below!