For those still sitting on the fence, contemplating whether or not to upgrade to vSphere 5, I thought I would share just one of the many upgraded components to help contribute to your decision. For those that have already deployed or tested vSphere 5 in a lab, you may notice the High Availability (HA) user interface looks identical to vSphere 4.1. However, under the hood, that couldn’t be further from the truth. VMware has completely rewritten the architecture for HA in vSphere 5. In this blog post, I will state the major improvements in HA, which I find to be just one of the many compelling reasons to go forward with your migration plan to vSphere 5.0. For those who are considering HA, keep in mind that HA is still not a substitute for traditional clustering within the guest operating system because it’s not application aware. Instead, HA is an option to protect the virtual machine itself.
With vSphere 4.1 and prior, HA utilized the Automated Availability Management agent (AAM). The AAM agent was completely dependent on the Virtual Center Agent Service (VPXA), which acted as a translator. vSphere 5 has instead incorporated the Fault Domain Manager agent (FDM). The FDM agent permits HA to have direct interaction with the hosted agent and vCenter, in turn, providing improved resiliency over the deprecated AAM agent. Another great improvement is the lack of DNS dependency. Instead, HA now relies strictly on IP addresses which, in turn, eliminates the previous 26 character hostname limitation. Just as important are the HA log files. The improvement here is that the log files are now included in the syslog which is a normal function of the ESXi log files. For those who don’t know, they can be found in the /var/log folder. The HA log is named fdm.log.
Gone is the old primary/secondary concept, which has instead been replaced by a master/slave orientation. Here, one host is the master and all others are its slaves. Although I will add, multiple Master nodes are possible when multiple network partitions exist. An improved heartbeat process, which differs from vSphere 4.1, sends a signal every second from the master to each slave. Each slave, in turn, sends a signal back to the master informing it of the current state of the VMs it is hosting, as well as to monitor the master itself. The slaves do not; however, exchange a heartbeat amongst themselves.
The roles of the master host include monitoring and communicating the states of the slave nodes with vCenter, as well as restarting guest VMs when a slave node failure occurs. What if the master node fails? In this case, an automated election process is initiated after the slaves detect a failure. The election takes approximately 15 seconds and will promote a slave node to a master to assume all of the roles from the failed master. The criteria that the election process uses to choose a new master is based simply on the greatest number of datastores connected to a slave. If one or more hosts have an equal number of datastores connected then the slave with the highest managed object ID will be selected. OK, so what if all hosts fail? Say a power failure.. In that case, the first host to power up begins the election process and chooses itself as the master. After the new master has been established, the master initiates restarts to the VMs that are HA protected but not running. This eliminates the necessity to know which node is the master.
A new addition to HA in vSphere 5 is the datastore heartbeat, which adds an additional layer of resiliency. Used when the master has lost network connectivity to a slave, Datastore heartbeats can validate whether a slave has failed or is isolated/network partitioned. Unlike vSphere 4.1, where validation didn’t occur, datastore heartbeat reduces unnecessary VM restart attempts. This works by vCenter using an algorithm to select two datastores per host by default. It defines a heartbeat region by leveraging a designated VMFS datastore and creating a file named: host- <number>-hb within the datastore on each host. This file is checked by the datastore heartbeat to see if it has been created so as to validate that host. The one caveat to the datastore heartbeat is that it’s no longer an advantage when a physical NIC has failed and both the network and SAN are inaccessible. This scenario can be avoided by NIC teaming.
Also included in vSphere 5 HA’s improvements is the increase of the maximum concurrent host HA failover at 32 nodes (31 minus the Master node). This is a failover percentage of 100% total cluster node capacity as opposed to vSphere 4.1, where the maximum concurrent host HA failover was only 4 nodes. Additionally, HA in vSphere 5 allows a maximum of 512 VMs per host, whereas vSphere 4.1 allowed only 320 VMs. I believe that is a significant improvement worth noting. Also included in vSphere 5 HA’s admission control policy improvements, you can now designate multiple failover hosts. Additionally, you can now specify the percentage of both CPU and memory use to determine a failover target independently.
The significant enhancements to HA have provided improved resiliency to clustering in the datacenter which makes it one of the most prominent improvements to vSphere 5.