Metro Availability for AHV – Part 2 (Primary Site Failure Test)

In Part 2 we will review the process to view and validate an Automatic Failover using AHV Metro Availability technology. Metro availability has two important technologies described below.

Synchronous replication is used when applications require zero data loss and must maintain consistency in two locations. Data is replicated as writes occur on the Source VMs and do not acknowledge those writes until they are committed on the target Nutanix cluster.

Metro availability refers to the automation layer which requires zero administrator interaction in the event of a primary site failure and uses a witness (built into Prism Central) to monitor site health then orchestrates preconfigured Recovery Plans to flip the environment to the target site.

The Test Environment

This test environment consists of the following

Source Cluster – Nutanix 4x NX-3060-G5 running AOS 6.5.3.6

Target Cluster – Nutanix 4x NX-3060-G5 running AOS 6.5.2.5

Prism Central – Managing both clusters above running PC code pc.2023.1.0.2

Application – Simple 2 tier Web application with a Web server VM and MySQL DB server VM. Source cluster is PHX-POC018

Protection Policy – Setup in a Synchronous Replication between the two clusters

Recovery Plan – Set to Automatic, with a failover detection of 30 seconds. Two stages, SQL DB comes up first, wait 60 seconds, Web server comes up second. Networks not stretched (in a production having stretched networks will make failovers simpler by not having to inject a script to update the static IPs).

Recovery Plan Page 1 – environment details/Failure Execution Mode
Application Recovery Power on Sequence
Network Setup

Failover Test – Primary Site Failure

In the following I will force stop the primary cluster by performing a hard power off from each nodes IPMI (OOB Management) and monitor in Prism Central the failover progress.

!WARNING! Don’t perform the following steps in production. 😅

Diagram of Site 1 Failure and corresponding HA event to Site 2

1. Force power off NX nodes from IPMI

Host Powered On
Host Powered Off

2. Back in Prism Central we will watch the tasks to see the Unplanned Failover get automatically initiated after the 30 seconds we defined in the Recovery Plan

After 30 seconds the Unplanned Failover triggered by the PC Witness Service started
Recovery Plan runs through all steps that were defined prior to the failure
Failover complete after only 2 minutes 30 seconds

3. Confirm Application is up and online now running on the Target cluster

VMs now running on PHX-POC016
Web application is up and online running on the x.x.16.x network as defined in the Recovery Plan

4. After bringing the failed cluster nodes back online we can see in a few minutes (once all CVM services start) that our VMs are now sync’d again from PHX-POC016 to PHX-POC018

Conclusion

The above test shows that with no user interaction an entire site failover was initiated without user interaction in 2 minutes and 30 seconds. All of this was managed in a single pane of glass through Prism Central using Nutanix’s own hypervisor, AHV. For companies requiring the lowest possible RPO/RTO for mission critical applications such as those running in manufacturing environments the use of Metro Availability using AHV will provide the highest level of resiliency when running applications that cannot provide app level consistency.

Share with the world!

Leave a Reply

Your email address will not be published. Required fields are marked *