When a network outage occurs, everyone is running like headless chickens trying to figure out why traffic isn’t flowing the way it should while at the same time angry calls are coming from all directions. Post-calamity, you are tired and you know what’s up next: the post-mortem or RCA. You’ll need to spend days collecting data from all the relevant individuals and composing a document that not only details what happened but also how you’ll avoid it in the future.
We all know the pain
Of course, the problem in most cases is the “how you’ll avoid it in the future” part is useless – “Change process A so extra check B is done by individual C” or “From now on approval of maintenance window for change of type X needs to be done by Y engineers.”, etc.
Nothing has really changed.
The Holy Grail ?
indeni is a product that collects known post-mortems, analyses and issues. We have an intelligent system that has a database of these and searches for them in your network before they become a major outage. Misconfigurations, suspicious performance stats, indicative logs – all those could be identified by a system that knows what to look for.
In Cisco-speak, think “sh run”, “sh ip eigrp neighbors detail”, “sh license”, etc. In Check Point-speak, think “cpstat fw”, “fw ctl get int”, “cphaprob stat”. In F5 think “show /cm sync-status”. And the list goes on.
indeni ingests enormous amounts of data, correlates it between devices (including devices made by different vendors) and makes intelligent deductions based on known facts – not unlike Sherlock Holmes.
indeni has spent five years perfecting a platform that consumes knowledge and then activates a method to find issues in the network equipment’s configuration before downtime occurs. We call the outcome of this a “pre-mortem”.
Within 45 minutes you can set up a virtual machine to analyze the configuration, logs and stats of your network devices such as Cisco routers & switches, Check Point firewalls, F5 ADCs, Juniper SSGs and Fortinet Fortigates (and the list keeps expanding). The output from the system is a list of issues found – their causes, what could happen if left as-is and what you should do to fix it (specific CLI commands, a manufacturer knowledge base article, etc.).
Convinced? Take us for a spin. Not yet convinced? Want to go deeper? Keep reading.
How does this work?
(10,000 foot view, more details further below)
We consume knowledge from three main sources:
- Manufacturer knowledge bases and forums – yes, those websites your Google search takes you to. We’ve built an automated system that can understand much of what’s written there and look for the specific issues in network equipment.
- Users requesting things – when we work with a new customer we ask them “what are the top issues you ran into in the past and would like to avoid?”. For those we don’t yet cover yet, we quickly add this new information to our knowledge base so that new and existing customer won’t have the same issues. This results in a network effect within our customer base – each new customer immediately benefits other customers.
- Data out of users’ networks – most users enable a service we call indeni Insight. It allows the instance of indeni in their network to send data it collects to our data center – such as inventory information, configuration details, logs observed, alerts issued, etc. All of the data is scraped from confidential information (such as IP addresses) and then analyzed as a whole. For example – we can automatically detect best practices when configuring core Cisco switches by comparing the configurations of thousands of 6509’s.
indeni sends out email alerts to a shared mailbox (or SNMP trap, if you want) that contains the details of what was found and the steps for remediation. We don’t issue the same alert twice for a given issue on a given device at a given time. Instead of those thousands of SNMP traps you get today that are largely ignored, we send very concise alerts and you’ll get no more than a couple a day in most environments. This will allow you to really give them the attention they deserve.
What’s more – if it ain’t actionable, it ain’t alerted. If we find something that you can’t action on we won’t tell you about it. Otherwise, what’s the point?
How does it work 2? Even deeper into the technology…..
Let’s give an example of a common issue.
You deploy a Check Point cluster of firewalls and connect them to a Cisco router. The way Check Point clustering works, there is a virtual IP that is tied to the physical MAC of the active firewall (NOTE: this behavior is the default which can be, but rarely is, changed). When there is a failover, the newly active member sends a gratuitous ARP (GARP) reply with the same virtual IP but a new MAC (its physical MAC). A GARP reply is like an ARP reply without an ARP request. The idea is to tell devices something has changed on the layer 2 level without waiting for them to issue an ARP request again.
Why is this important? Well, if you don’t issue the GARP, or it is dropped by the receiving device, then traffic keeps getting sent to the old MAC address (the original active member) and nobody gets it. This will keep happening until the ARP entry times out on the router and an ARP request is sent again. This can be 4 hours (!!). Imagine a failover that instead of going smoothly results in 4 hours of downtime. Eeek.
The default, by the way, for a Cisco router is to accept GARP replies. We usually see this functionality disabled during security audits.
So how does indeni help here? This known issue is a case in our knowledge base that our system knows how to look for. Since we can see which routers a firewall is connected to, we know how their ARP caches look like. We also know when a failover occurs in a cluster. So, when we see a failover in a Check Point firewall cluster, we look for a few things:
- Did the ARP entry for the virtual IP update on the routers?
- Is the new member receiving the traffic it should?
If we see an issue, we send an alert that describes the problem:
RX traffic drastically reduced post failover, possible ARP issue
A failover was identified at Device time: Sep 18 00:31 2013 UTC, Indeni time: Sep 17 20:31 2013 EDT. This device is now the active member of the cluster and in the period immediately following the failover (3 minutes more or less) it received 0 packets compared to 2067098 packets that were received by sfdc-wanfw1 (18.104.22.168) in a similar amount of time immediately BEFORE the failover. This indicates the possibility that the surrounding network equipment may not be aware of the failover on the layer 2 level.
Manual Remediation Steps:
It is possible this is caused by the fact that during a failover the responsibility for the virtual IPs moves from one cluster member to the other and the MAC addresses change. ClusterXL issues gratuitous ARPs to deal with this but it may not work with your equipment. Please review SK50840 for more information.
Potential 4-hour downtime reduced to 1 minute.
Now do you want to take us for a spin ?