VMware vSAN - A Closer Look [Part 5 - Failure Events]

If you would like to read any other chapter's of this blog series, click the links below:

So far I’ve covered the various methods vSAN uses to protect data within a cluster and even across sites. In this post I will look at how vSAN handles failures and the processes involved to recover data to ensure the virtual machines redundancy requirements are met.

Failure Events

There are two categories used within vSAN when a hardware failure is detected, absent and degraded.

  • Absent - vSAN will wait 60 minutes before attempting recovery of objects and components.
  • Degraded - vSAN will immediately try to recover objects and components where possible.
1.png

If a disk fails within a host vSAN and error codes are detected by vSAN all the affected components are marked as degraded and vSAN will attempt to create a new copy of the data if there are resources are available to do this. If there are no resources available it will wait until the failure has been resolved. During this period, the virtual machines affected by this failure will still be available and continue to run.  If the cache device fails vSAN marks the entire disk group as degraded.

If a disk fails without warning (no error codes detected) vSAN will mark all the affected components located on the device as absent and a 60 minute delay timer is started. There could be a scenario where a disk is accidently removed from a host and if the disk is placed back within this time window the components are resynchronised an vSAN carries on as normal. This is preferred over the degraded state to save resources being consumed by a rebuild. If the 60 minute delay timer expires the components marked as absent will be rebuilt on other hosts in the cluster if resources are available to do this. If the capacity device fails vSAN marks the entire disk group as absent.

2.png

If  a host fails vSAN all objects associated to the host are marked as absent by vSAN and the default timer of 60 minutes starts. If the host comes back online within this time then the components are resynchronised and vSAN continues as normal. If the time exceed 60 minutes the affected components and objects are rebuilt on other hosts in the cluster if there are resources to do this.

Maintenance Mode and Self Healing

3.png

Personally I’ve felt an area that has caused some confusion is what happens when a host is placed into maintenance mode. There are several options presented when attempting to place a host into maintenance mode. The reasons for placing a host in maintenance mode may influence the choice made. If it’s a just a case of rebooting the host then you may wish to select the default choice which in to ensure data is accessible from other hosts. In the example shown there is a risk associated with this decision in that 9 objects will become non-compliant. Now this doesn’t mean that the virtual machines will be inaccessible, it means they will no longer comply with the policy assigned to them, which in this case is to tolerate a single host failure. Any virtual machines with a pFTT policy setting of 0 that has components located on this host will be marked as inaccessible. This would mean data loss are there are not sufficient copies of the data to maintain availability. As a quick example, I’ll attempt to put a second host of my three host cluster into maintenance mode. This warns me that 9 objects will become inaccessible so it’s probably not a good idea to go ahead with this.

4.png

If the host being placed into maintenance mode will be removed then the obvious choice is to migrate all the data to other hosts in the cluster. This leads to another design consideration on the minimum number of hosts to have in a cluster, but I’ll come back to this shortly.

A new feature in v6.6 is the option to run a pre-check if you need to remove a disk or disk group from the cluster. Similar to placing a host into maintenance mode there are three options available.

5.png

Ensure Accessibility

This ensures all virtual machines on this host will remain accessible if the host is shutdown or removed from the cluster.

Full Data Migration

All data will be migrated from the chosen host to others in the cluster. As a large amount of data will be copied this option will consume the most resources and take a while. However it ensures all virtual machines remain compliant with their assigned storage policy.

No Data Migration

vSAN will not migrate any data form the selected host. This means some virtual machines may become inaccessible if the host is shutdown or removed from the cluster. The recommendation is not to use this option unless in specific circumstances as there is a risk of data loss with the option.

As you can see vSAN will make intelligent decisions based on the information it has available to deal with failures and maintenance within the cluster. vSAN can rebuild data on another host or disk automatically to ensure virtual machines comply with their assigned policy. This leads to an important decision around the minimum number of hosts needed in a vSAN cluster. Whilst 2 hosts plus a witness and 3 host clusters are fully supported and can maintain an accessible copy of the data during a failure there is no scope for any degraded components to be rebuilt automatically or self-healed. It’s also worth considering the risks when performing maintenance on hosts, for example patching or upgrading. If you had a failure with one host already in maintenance mode then there will not be sufficient copies of the data to maintain availability. With this in mind I strongly recommend considering a four host cluster when deploying vSAN.

Next up is a review of some of the data reduction features found in vSAN and how they work.

More in the series