Objectives 4.3 are broke down as the following
- Analyze and resolve DRS/HA faults
- Troubleshoot DRS/HA configuration issues
- Troubleshoot Virtual SAN/HA interoperability
- Resolve vMotion and storage vMotion issues
- Troubleshoot VMware Fault Tolerance
Analyze and resolve DRS/HA faults
As with the other troubleshooting modules a lot of this comes down to experience, its difficult to demonstrate these topics so instead I will link the relevant VMware documentation and show what I can. VMware vSphere Troubleshooting guide can be found here. Chapter 3 covers HA and DRS faults.
You can view the overall health of a HA enabled cluster from the Web Client - Host and Clusters - Cluster - Monitor - vSphere HA - Summary
View Configuration Errors. From here you will be to see if there is any HA agent errors for the host, in my example I have a host offline but you can see the cluster cannot communicate with the HA agent.
The same overview health tabs and be used for DRS from Web Client - Host and Clusters - Cluster - Monitor - vSphere DRS
Troubleshoot DRS/HA configuration issues
VMware’s vSphere Troubleshooting guide can be found here and covers some scenarios that I wont repeat here but if you are sitting the exam check it out.
One common issues when deploying HA can be the initial HA configuration, reasons for failure can include
- Host communication errors - troubleshoot network issues between host.
- Timeout errors - Possible causes include that the host crashed during the configuration task, the agent failed to start after being installed, or the agent was unable to initialize itself after starting up. Verify that vCenter Server is able to communicate with the host.
- Lack of resources - Free up approximately 75MB of disk space. If the failure is due to insufficient unreserved memory, free up memory on the host by either relocating virtual machines to another host or reducing their reservations. In either case, retry the vSphere HA configuration task after resolving the problem.
- Reboot pending - If an installation for a 5.0 or later host fails because a reboot is pending, reboot the host and retry the vSphere HA configuration task.
Another common issue and one that could be easily simulated in the exam would be a HA agent network partition or a HA agent network isolation.
A network partition is reported if both of the following conditions are met
- The vSphere HA master host to which vCenter Server is connected is unable to communicate with the host by using the management (or Virtual SAN) network, but is able to communicate with that host by using the heartbeat datastores that have been selected for it.
- The host is not isolated.
To see what datastores are configured for heartbeats Web Client - Host and Clusters - Cluster - Monitor - vSphere HA -Heartbeat
Make sure the hosts can communicate across the management network (or Virtual SAN network). To find the assigned VMKernel interface Web Client – Host – Manage – Networking - VMKernel Adapters
Use vmkping command to specify the interface when troubleshooting
For HA related logs check out
- /var/log/fdm.log
- /var/log/vmkernel.log
Troubleshoot Virtual SAN/HA interoperability
Virtual SAN has its own network. When Virtual SAN and vSphere HA are enabled for the same cluster, the HA interagent traffic flows over this storage network rather than the management network. The management network is used by vSphere HA only when Virtual SAN is disabled. vCenter Server chooses the appropriate network when vSphere HA is configured on a host. Virtual SAN can only be enabled when vSphere HA is disabled.
The following was taken from the VMware documentation that shows the networking differences.
If you change the Virtual SAN network configuration, the vSphere HA agents do not automatically pick up the new network settings. So to make changes to the Virtual SAN network, you must take the following steps in the vSphere Web Client.
- Disable Host Monitoring for the vSphere HA cluster.
- Make the Virtual SAN network changes.
- Right-click all hosts in the cluster and select Reconfigure for vSphere HA.
- Re-enable Host Monitoring for the vSphere HA cluster.
Resolve vMotion and storage vMotion issues
To troubleshoot vMotion start with the network, try a vmkping and specify the interface and check the correct TCP/IP stack is configured. If jumbo frames have been enabled use vmkping to test jumbo frames are working.
>vmkping -I vmkernel_interface ip_address -d -s 8972
Note the first test fails but if I drop the frame size to 1470 it works indicating jumbo frames is not working. Keep in mind the limitations for vMotion, the test may well stimulate these conditions and ask you to solve them.
- The source and destination management network IP address families must match. You cannot migrate a virtual machine from a host that is registered to vCenter Server with an IPv4 address to a host that is registered with an IPv6 address.
- You cannot use migration with vMotion to migrate a virtual machine that uses a virtual device backed by a device that is not accessible on the destination host. For example, you cannot migrate a virtual machine with a CD drive backed by the physical CD drive on the source host. Disconnect these devices before you migrate the virtual machine.
- You cannot use migration with vMotion to migrate a virtual machine that uses a virtual device backed by a device on the client computer. Disconnect these devices before you migrate the virtual machine.
Some additional conditions apply to Storage vMotion
- Virtual machine disks must be in persistent mode or be raw device mappings (RDMs). For virtual compatibility mode RDMs, you can migrate the mapping file or convert to thick-provisioned or thin-provisioned disks during migration as long as the destination is not an NFS datastore. If you convert the mapping file, a new virtual disk is created and the contents of the mapped LUN are copied to this disk. For physical compatibility mode RDMs, you can migrate the mapping file only.
- Migration of virtual machines during VMware Tools installation is not supported.
- Because VMFS3 datastores do not support large capacity virtual disks, you cannot move virtual disks greater than 2TB from a VMFS5 datastore to a VMFS3 datastore.
- The host on which the virtual machine is running must have a license that includes Storage vMotion.
- The host on which the virtual machine is running must have access to both the source and target datastores.
Troubleshoot VMware Fault Tolerance
Troubleshooting VMs protected by FT can come in different shapes, see VMware’s vSphere Troubleshooting guide that can be found here, for this section see chapter 2. For the exam try and imagine what scenario could be simulated in a test lab and concentrate on that. Some stand out scenarios for me would be
- Secondary VM is powered on with FT enabled and no compatible hosts are available.
- Misconfiguration on the FT network leading to latency issues.
- Access to FT metadata datastore was lost.
- Enabling FT on a VM fails.
- FT failover fails due to partial storage and network failures or misconfiguration.