Shrinking a Nutanix Cluster

 

In this post I will walk through how we would remove a node from an existing Nutanix cluster with a couple of clicks and monitor the progress of the node removal.

You may ask, “Why would I want to remove a node from an existing cluster“. Well, one scenario may be if you have multiple clusters and one cluster needs resources while the others can handle a node removal and still remain in a state to self heal from a node failure. Whatever the reason is, having the capability to dynamically add and remove resources on the fly and without downtime is one of the primary reasons that the shift to a cloud consumption model is paying off for companies all across the globe.

Node Removal Process

1. First go to the ‘Health‘ page and run NCC to ensure the cluster is in a healthy state prior to running the node removal process.

 

 

 

 

 

 

 

 

 

 

 

 

 

2. Next go to the ‘Hardware‘ view and select the ‘Table‘ tab to view all of the hosts. Then find the one you would like to remove and highlight it.

 

 

 

 

 

 

3. At this point you can hit Remove Host. You will notice that the node I selected for removal has 246.94 GiB of data which will need to be migrated off prior to completing the node removal process.

Review the warning message below which states VMs will automatically be migrated off of the node prior to removal. Click Remove.

 

 

 

 

 

4. Now the process will run in the background and you will see a bunch of tasks kicked off. If you go to Storage –> Table –> Storage Pool and select the Storage Pool you can scroll down slightly and monitor the Storage Pool Performance.

5. Once all of the tasks are complete at 100% you can go back to Hardware –> Table and see that you now have 3 nodes in the cluster.

 

 

 

 

 

 

 

A few things to note here.

  • This removal was performed on a POC unit with only 1GBe connectivity
  • The POC cluster has 4 nodes and I was removing 1 leaving 3 which is the minimum required for a cluster to be fully operational and not in a degraded state.
  • You would see similar performance results if a node failed completely but in that case you would have 3 nodes working to rebuild replica data copies instead of 4 nodes working to move data off of the 4th node and onto the remaining 3 thus resulting in higher throughput capabilities due to the extra node assisting in the data migration process.
  • Even with a 1GBe network we were able to remove the node in about 30 minutes.
  • As with true web-scale architecture, the more nodes we have, the quicker this removal process would have gone. If I had 8 nodes, it would have been completed twice as fast.
Share with the world!

Leave a Reply

Your email address will not be published. Required fields are marked *