In a previous post, I went over using Nutanix Protection Domains as a means for DR failover/site migration activities. Protection domains are powerful, providing schedules, consistency groups, and incremental based replication down to a 1 minute interval. What Protection Domains do not provide are once the migration or site failover has occurred, it is up to the user to understand the proper power on procedure for each application and IP changes required.
In a small environment, this may be okay, but in a very large environment and with teams under pressure to bring systems online as fast as possible, understanding the proper power on procedure requires…
1) A complete understanding of the applications involved in the failover.
2) A complete understanding of how to power on each application and the proper order.
3) Multiple team members on a Webex/Zoom meeting to assist in understanding the environment as our business continuity admins bring up the environment.
4) Network architects to be sure IP changes are being addressed (assuming this is not a stretch layer-2 subnet where IPs can persist across prod and DR).
and I am likely missing more. The basic gist is we need a ton of experts rallying around our virtualization engineers to assist in a successful disaster recovery. Typical businesses that I interact with are still providing 24-48 or more hours to bring up their environments in DR. In todays world, where virtualization is at minimum 95% of our environment, we should demand more and are capable of more where businesses could be online in minutes fully restoring business operations. The old adage of “time is money” applies directly to how IT departments can help the business in how they respond to failures.
With that being said, this post will describe how Nutanix customers can build their recovery plans into code! With Nutanix controlling the entire infrastructure stack this enables a unique insight into the data plane and allows for simple, performant, and consistent disaster recovery.
PREREQS: The following assumes that we have applications built and placed into Prism Central categories. To quickly stand up a 3tier application I used Nutanix CALM to perform the entire application build out as well as apply Prism Central categories during the deployment which means these VMs (if deployed after configuring protection policies) would be immediately start replicating to the target DR after CALM deploys the application. Look for a future post on CALM 🙂
1.The first thing we need is a Prism Central to manage the availability zone, protection policies, and recovery plans. In Prism Central, go to Availability Zones –> Connect to Availability Zone
2. Next we need to decide if we will DR to the Xi Cloud (Nutanix managed public cloud service) OR will these be another on premise Nutanix cluster in our DR site. Select your desired replication target. (for this exercise I am using the Nutanix hosted POC lab which will act as an on premise target)
Fill out your target Prism Central details.
Verify your remote site is Reachable.
3. Now we need to build our protection policies. First click ‘Create Protection Policy’.
Next we need to fill out all of the details for the Protection Policy
- Recovery Location
- Recovery Point Objective RPO
- Retention Policy
- Categories associated to this Protection Policy
Once saved we should see the below…
4. Now we need to create a Recovery Plan. First go to the Recovery Plan page in Prism Central. Click Create New Recovery Plan
NOTE: This is the DR as code piece where we actually define the DR process and runbook recovery into code via the Prism Central GUI.
Pick your Recovery Location
This will be done in 3 steps. First give a Name.
Next add your entities. This is either a single VM or a group of VMs that were placed in Prism Central categories.
Now we build out the Network Mapping.
Here is an example of what this means…
Map your Prod to Prod-DR and Test to Test-DR.
Running a “Test Failover” means I can build my entire production environment in DR inside of a predetermined “Test Network” where we can do a DR test while keeping Production online!
5. Once the Recovery Plan is save you should shortly see the Recovery plan appear on the target DR site. (as seen below)
In a following post I will show how we perform a complete DR failover which is now completely written into code.