An untested failover plan is a failover that won’t work when you need it. You know it’s true, and it’s been nagging you for a long time now.
The Datanauts dedicate today’s show to examine why, how, and what to test. We talk about reasons for failover testing, offer tips on how to make and execute a plan, and what to do before, during, and after to get the most benefit from the exercise.
Think of this show as a gentle nudge to get that test scheduled.
This episode of Datanauts is brought to you by ITProTV. Enhance your technology aptitude. ITProTVis the resource to keep your I.T. skills up to date, with engaging and informative video tutorials. For a free 7-day trial and 30% off the life of your account, go to itpro.tv/datanauts and use the code DATANAUTS30.
The Datanauts are sponsored by Altaro Software, developers of virtual backup trusted by over 30,000 SMBs. If you need an easy-to-use and affordable Hyper-V and VMware backup solution, try Altaro VM Backup for 30 days. Visit go.altaro.com/datanauts/ and throughout the month of June Datanauts listeners will get a free Altaro t-shirt. Plus, after the 30-day trial you can back up 2 VMs for free, forever!
Part 1 – Why Test Failover?
- Infrastructure changes. (Did something change to break failover? Is new capacity being used?)
- Data centers aren’t built to match. (Data center B was built from old DC A stuff.)
- Capacity planning – traffic loads change. (Can the servers handle the load? What about the services? What about network devices? What about network circuits?)
- Applications change. (Will they still operate as expected when failed over?)
- IT process is often sloppy, meaning docs aren’t getting updated. (Do we even know how to fail over anymore?)
- Tribal knowledge from the previous set of engineers
- Docs are out of date the moment they are written
- Too many moving parts with fragile dependencies. (It worked yesterday, will it work today?)
- New products come along to improve DR scenarios
- Services could be badly monitored. (Something’s down and you don’t know it because you don’t usually use it.)
- Infrastructure is monitoring environmental details, but not applications
- Performance of the application is often missed – just up/down statistics
- Everyone needs to practice fire drills
- Expensive, stretched infrastructure that is designed to fail between data centers
- Have you tested that feature / functionality?
- Do you know it is configured properly and will survive an unplanned outage?
Part 2 – How To Test Failover
- On a regular schedule
- Hold yourself to an SLA for testing
- Good testing = less stress and more understanding of failover scenarios
- Schedule a maintenance window
- Handy in case things don’t go well
- Communicate – set both internal and external customer expectations
- Have a blocker – someone with executive or director-level buy-in who shields the IT team from annoying questions and status updates
- Test real load conditions. Only under load do certain problems show up.
- Available listeners to handle inbound connections
- Network capacity
- Load balancer configuration
- Deep packet inspection bottlenecks
- Licensing (for application or users on the system)
- Storage traffic is critical for workloads that are going to “float” to the other DC
- Understand how your management tier is going to enforce failover domains (vCenter, SCVMM, infrastructure domains, etc.)
- If possible, leave your traffic at the opposite data center for 24 or more hours. Just let it run, while monitoring closely.
- Examples: follow the sun computing
- Get to the point where DC A vs. DC B is irrelevant to your applications. Mix it up at will.
- Treat the DC like a pool of resources
Part 3 – What To Verify
- Verify time to switch over with actual clients. DNS + short TTL = fast hostname change, which many clients don’t care about
- vMotions take time. Multiple vMotions take more time. Hint: massive vMotion is not a failover strategy. Assume DC A is instantly black — you won’t have time to vMotion anything.
- Verify database synchronization
- Verify storage synchronization and availability
- Make sure that IO is not going back to the “dark” data center
- Try to terminate any access to the original storage array
- Verify that there are NO inter-DC dependencies. Failover is not failover if both DC’s must be up to support the application.
- When failing back, how long before the data sets are synced again? Take note – is the time reasonable for normal business operations? Make sure you haven’t outgrown the solution you initially deployed.
- Verify that more than one person can handle a failover. No Brents allowed.
- Phoenix Project
- Backup person != an entire person dedicated to being a backup. Can use role sharing / knowledge sharing