VMware Virtual Center – Physical or Virtual?

Over the years there have been some controversy over this topic. Should Virtual Center (vCenter) be physical or virtual? There is the argument that it should be physical to ensure consistent management of the virtual environment. Of course there is also the fact that Virtual Center requires a good amount of resources to handle the logging and performance information.

I’m a big proponent for virtualizing Virtual Center. With the hardware available today there is no reason not to. Even in large environments that really tax the Virtual Center server you can just throw more resources at it.

Many companies are virtualizing mission critical application to leverage VMware HA to protect these applications. How is Virtual Center any different. So what do you do if Virtual Center crashes? How do you find and restart Virtual Center when DRS is enabled on the cluster?

You ave a few options here.

  1. Override the DRS setting for the Virtual Center vm and set it to manual. Now you will always know where your virtual center server is if you need to resolve issues with it.
  2. Utilize Powershell to track the location of your virtual machines. I wrote an article that included a simple script to do this which I will include on our downloads page for easy access.
  3. Run an isolated 2 node ESX cluster for infrastructure machines.

So my last option warrants a little explaining. Why would you want to run a dedicated 2 node cluster just for infrastructure vms? The real question is why wouldn’t you? Think about it. Virtual Center is a small part of the equation. VC and your ESX hosts depend on DNS, NTP, AD, and other services. What happens if you loose DNS? You loose your ability to manage your ESX hosts through VC if you follow best practice and add them by FQDN. Now if AD goes down you have much larger issues, but if your AD domain controllers are virtual and you somehow loose them both that’s a problem. It’s a problem that could affect your ability to access Virtual Center. So why not build an isolated two node cluster that houses your infrastructure servers. You’ll always know where they will be, you can use affinity rules to keep servers that back each other up on separate hosts, and you can always have a cold spare available for Virtual Center.

Obviously this is not a good option for small environments, but if you have 10,30, 40, 80, 100 ESX hosts and upwards of a few hundred VM’s I believe this is not only a great design solution but a much needed one. If you are managing this many ESX hosts it’s important to know for sure where your essential infrastructure virtual machines reside that affect a large part of your environment if not all.

Business Continuity and Disaster Recovery with Virtualization

In the previous years Business Continuity and Disaster Recovery have been big buzz words. All companies small and large vowed to launch initiatives to implement either or both in their current IT strategies. My question is what happened? Why is it that I rarely see organizations that have implemented or even have a plan to implement Disaster Recovery?

Is it a lack of understanding? Is it that most companies believe it is to expensive or complicated to implement? Well it doesn’t have to be either. Most companies that are undergoing virtualization initiatives already have half if not more of what they need to implement Disaster Recovery. The simple fact is if you already have at least two data centers and are virtualizing you are a prime candidate. Here are some common question and my answers regarding this subject:

1.) Do I need to utilize SAN replication to implement Disaster Recovery in a virtualized environment?

No! There are other option to achieve Disaster Recovery without SAN replication. If you are running VMware you can utilize some of what you already have. VMware VCB in conjunction with VMware converter can be used to implement Disaster Recovery. Now this wouldn’t be as elegant as doing SAN replication but you could implement scheduled V2V’s of your Virtual Machines from one site to another and it’s a very simple solution to implement.

What about the hardware right….where do we get the additional hardware? The answer is simple reuse what you already have. Take those old servers you just freed up and put them to some good use. Beef them up! Need more ram in them tear ram out of some and add it to other, do the same with CPU’s to make a number of more power servers that you can use for DR. Granted you may need more of the reused servers to host all the vm’s needed but at the end of the day you would have a disaster recovery plan.

2.) What if I can’t do SAN replication but want synchronous and asynchronous replication?

This can still be achieved using software based replication in your virtual machines. Software like NSI Doubletake and Replistor provide this functionality at a a relatively low cost. With virtualization you can cut cost even more. With physical servers you traditionally needed to have a 1 to 1 mapping for replication which required a license for each host. With virtualization you can take a many to one approace cutting down on the licenses you need to replicate your data.

With this approach I would still use VCB or VMware converter to make weekly copies of your virtual machine OS drives. You can then utilize one of the mentioned applications (Doubletake or Replisor) to synchronous replication of your data volumes. You can achieve this and save licenses by installing say Doubletake on each of the source systems. The you would create a virtual machine at the DR site and add a drive to it for each of the source systems data volumes and replicate each source data to a different data volume on the destination vm. If you ever need to fail over just dismount the volumes from the destination vm and attach each one it’s respective vm that was created through the use of VCB or VMware converter.

3.) These methods are great but what would it take to bring an environment back up using them?

That’s rather hard to say because it depends on the size of your environment and how many vm’s you are relocating to your DR site. If your environment is large and you have specific SLA’s to adhere to regarding RTO (Recovery Time Objective’s) and RPO (Recovery Point Objectives) then you should consider SAN to SAN replication and utilizing something like VMware SRM which does an outstanding job of handling this. VMware SRM also allows you to run disaster recovery simulations to determine the effectiveness of your DR strategy that allows you to determine if you are meeting your SLA.

If you are doing DR on the cheap the real answer is to this question is you will be able to recover your systems a heck of a lot quicker than if had to restore via backups of rebuild your systems.

4.) This is great but where do we begin?

Don’t know where to begin, the answer is easy. Start small and grow into it. Find at least 2 servers that you can reuse beef’em up determine a configuration for them and deploy ESX to the servers. You need to have some infrastructure in place at your DR location to make DR work so that is a good place to start. You need to add the following service at your DR location:

  • Active Directory Servers
  • DNS Servers
  • NTP Servers
  • Virtual Center Server

It may be required to to deploy additional servers for your specific environment but I think you get the idea.

Next pick a few development machines or test machines that you can replicate to the DR site. Develop a plan and schedule down time and perform a test fail over to the remote site. Once you have work out the kinks and have a written DR plan determine your first phase of servers to incorporate into your DR site. Generally at this point you would want to pick some of your most valuable servers to ensure they are protected.

You can then break all your servers that need to be replicated into phases and determine the host requirements at the DR site and develop a plan for each phase of your DR implementation. It would be a good idea to have a remote replication vm for every 20 or so source vm’s. This really would depend on the data chance rate of your servers but 20 is a good starting point.

This article is obviously not all inclusive and is very high level but hopefully it inspires some of you to start developing a DR strategy and at least start testing some of these solutions in your environment because data is a terrible thing to waste.

VMware HA Cluster Sizing Considerations

To properly size a HA fail over cluster there are a few things that need to be determined.  You need to know how many hosts are going to be in your cluster, how many hosts you want to be able to fail (N+?), and it helps to know resource utilization information about your vm’s to gauge fluctuation.  Once we know this information we can use a simple formula to determine the maximum utilization for each host to maintain the appropriate DRS fail over level.

Here is an example:

Let’s say we have 5 hosts in a DRS cluster and we want to be able to fail (1) hosts (N+1).  We also want to have 10% overhead on each server to account for resource fluctuation.  First we need take 10% off the top of all (5) servers which leaves up with 90% utilizable resources on all hosts.  Next we need to account for the loss of (1) hosts.  In the event that a host is loss we need to distribute its load across the remaining (4) host.  To do this we need to divide up one hosts 90% possible resources by (4) remaining hosts.  This tells us that we need to distribute 22.5% of the servers load to each of the remaining hosts.

Taking in to account the original 10% over head plus the 22.5% capacity needed for fail over we need to have 32.5% of each hosts resources available which means we can only utilize 67.5% of each host in the cluster to maintain an N+1 fail over cluster with 10% overhead for resource fluctuation.  The formula for this would be:

((100 – %Overhead)*#host_failures)/(num_hosts – #host_failures)+%overhead = overhead needed per ESX host

Example 1:

((100-10)*1)/(5-1)+10 = 32.5    
(5 Server cluster with 10% overhead allowing 1 host failure) 67.5& of each host usable

((100-20)*2)/(8 -2)+20 =46.6   
(8 Server cluster with 20% overhead allowing for 2 host failures) 53.4% of each host usable

Example 2:

Fail over of 1 host

((100-20)*1)/(8 -1)+20 =31.4   
(8 Server cluster with 20% overhead allowing for 1 host failures) 68.6% of each host usable

Fail over of 2 hosts

((100-20)*2)/(8 -2)+20 =46.6   
(8 Server cluster with 20% overhead allowing for 2 host failures) 53.4% of each host usable

Determining the %Overhead can be tricky without a good capacity assessment so be careful if you don’t allocate enough overhead and you have host failures performance can degrade and you could experience contention within the environment.  I know some of the numbers seem dramatic but redundancy comes with a cost no matter what form of redundancy it may be.