To properly size a HA fail over cluster there are a few things that need to be determined. You need to know how many hosts are going to be in your cluster, how many hosts you want to be able to fail (N+?), and it helps to know resource utilization information about your vm’s to gauge fluctuation. Once we know this information we can use a simple formula to determine the maximum utilization for each host to maintain the appropriate DRS fail over level.
Here is an example:
Let’s say we have 5 hosts in a DRS cluster and we want to be able to fail (1) hosts (N+1). We also want to have 10% overhead on each server to account for resource fluctuation. First we need take 10% off the top of all (5) servers which leaves up with 90% utilizable resources on all hosts. Next we need to account for the loss of (1) hosts. In the event that a host is loss we need to distribute its load across the remaining (4) host. To do this we need to divide up one hosts 90% possible resources by (4) remaining hosts. This tells us that we need to distribute 22.5% of the servers load to each of the remaining hosts.
Taking in to account the original 10% over head plus the 22.5% capacity needed for fail over we need to have 32.5% of each hosts resources available which means we can only utilize 67.5% of each host in the cluster to maintain an N+1 fail over cluster with 10% overhead for resource fluctuation. The formula for this would be:
((100 – %Overhead)*#host_failures)/(num_hosts – #host_failures)+%overhead = overhead needed per ESX host
Example 1:
((100-10)*1)/(5-1)+10 = 32.5
(5 Server cluster with 10% overhead allowing 1 host failure) 67.5& of each host usable
((100-20)*2)/(8 -2)+20 =46.6
(8 Server cluster with 20% overhead allowing for 2 host failures) 53.4% of each host usable
Example 2:
Fail over of 1 host
((100-20)*1)/(8 -1)+20 =31.4
(8 Server cluster with 20% overhead allowing for 1 host failures) 68.6% of each host usable
Fail over of 2 hosts
((100-20)*2)/(8 -2)+20 =46.6
(8 Server cluster with 20% overhead allowing for 2 host failures) 53.4% of each host usable
Determining the %Overhead can be tricky without a good capacity assessment so be careful if you don’t allocate enough overhead and you have host failures performance can degrade and you could experience contention within the environment. I know some of the numbers seem dramatic but redundancy comes with a cost no matter what form of redundancy it may be.