Caution: Articles written for technical not grammatical accuracy, If poor grammar offends you proceed with caution ;-)
OK..I’ll admit it: I am spoiled by the capabilities of vSphere. What other platform lets you schedule system updates that will occur unattended and without outages of the applications being used? I don’t mean the winders patches, they require a monthly reboot. I am talking about the hypervisor updates. VMware Update Manager coordinates all of this for you. Then along comes vShield Zones to break it all.
First, let me explain what I am trying to do. To simplify things, vShield Zones is a firewall for vSphere Virtual Machines. Rather than regurgitate how it works, take a look at Rodney’s excellent post. A customer has decided to use vShield Zones to help with PCI Compliance. The desire is that only certain VMs will be allowed to communicate with certain other VMs using specific network ports, and to audit that traffic. ’nuff said.
vShield Zones seems to be the perfect solution for this. It works almost seamlessly with vCenter and the underlying ESXi hosts. It provides hardened Linux Virtual Appliances (vShield Agents) to provide the firewalling. It provides a fairly nice management interface to create the firewall rules and distribute them to the vShield Agents. Best of all, IT’S FREE! At least for vSphere Advanced versions and above. Keep in mind, that this is still considered a 1.x release and some things need to be worked out.
Now, on to the gotchas.
Gotcha #1 – Networking
When it comes to networking, the vShield Agent is designed to sit between a vSwitch that is externally connected via physical NICs (pNICs) and a vSwitch that is isolated from the outside world. The vShield Agent installation wizard will prompt you to select a vSwitch to protect. This is illustrated below. The red line indicates network traffic flow.

This works like a champ in this configuration, using a vSwitch for management, which is naturally on an isolated network to begin with, using a vSwitch for VMs to connect to the vShield Agent and using a vSwitch to connect everything to the outside world. This can also be deployed with limited down time. If you are lucky enough to have the Enterprise Plus version, you may want to use a vNetwork Distributed Switch or even a Cisco 1000v. You will need to make some manual configurations to make this work as outlined in the admin guide.
The gotcha is with blade servers or “pizza box” servers that have limited I/O slots. If all of the VM traffic must flow through the same physical NICs and you use a vSwitch, then you need the vShield Agent to protect a port group rather than an entire vSwitch. You will need to create a vSwitch with a protected port group and connect it to the pNICs. Then you you can install the vShield Agent. Once the vShield Agent is installed, you will need to go back to the vSwitch attached to the pNICs and add an unprotected port group. This is illustrated below. The red line is the protected traffic and the blue line is the unprotected traffic.

As you can see, there is an unprotected Port Group (ORIGINAL Network). This needs to be added to the vSwitch AFTER the vShield Agent is installed. If the ORIGINAL Network is already a part of the vSwitch, it will need to be removed BEFORE installing the vShield Agent. In order to avoid an outage, you will need to disable DRS and manually vMotion all VMs off of the ESX/ESXi host before installing the vShield Agent and modifying the port groups.
Gotcha #2 – DRS/HA Settings
The vShield Agents attach to isolated vSwitches with no pNIC connection. As you should already know, using DRS and vMotion on an isolated vSwitch could cause inter-connectivity between VMs to fail. By default, you cannot vMotion a VM that is attached to an isolated vSwitch. You will need to enable this by editing the vpxd.cfg file. You will also need to disable HA and DRS for the vShield Agents so they stay on the hosts where they are installed. Both are well documented. Obviously, you will need to install a vShield Agent on every ESX/ESXi host in the cluster.
The Gotcha here is that, with HA disabled for the vShield Agent, there is no facility for automatic startup. There is an automatic startup setting in the startup/shutdown section of the configuration settings. First, this is an all-or-nothing setting. Second, according to the Availability Guide:
“NOTE The Virtual Machine Startup and Shutdown (automatic startup) feature is disabled for all virtual machines residing on hosts that are in (or moved into) a VMware HA cluster. VMware recommends that you do not manually re-enable this setting for any of the virtual machines. Doing so could interfere with the actions of cluster features such as VMware HA or Fault Tolerance.”
So, if a host fails, HA will restart all protected VMs on different hosts. If the host comes back on line, you risk having DRS migrate protected VMs back to that host. This will cause those VMs to become disconnected because the vShield Agent will not automatically start. If a host fails, hope that it fails good enough so it won’t restart.
Gotcha #3 – Maintenance Mode
At the beginning of this post, I mentioned how VMware Update Manager has spoiled me. VUM can be scheduled to patch VMs and hosts. When host patching is scheduled, VUM will place one host in Maintenance Mode, which will evacuate all VMs. Then, it will apply whatever patches are scheduled to be applied, reboot and then exit Maintenance Mode. It will repeat this for each host in a cluster. This works great unless there are running VMs that have DRS disabled, like the vShield Agent.
In the test environment, when a host was manually set to enter Maintenance Mode, it would stall at 2% without moving the test VMs. I am not sure the order that VMs are migrated off, but none were migrated in the test environment. This could vary in different installations. Here’s the gotcha: you cannot power the vShield Agent off because the protected VMs would become disconnected. You cannot migrate it to a different host because it would cause a serious conflict and cause protected VMs to become disconnected. The only thing you can do is place the host in Maintenance Mode, then MANUALLY (*GASP*) migrate all of the protected VMs and then power the vShield Agent off. So much for automated patch management. We’re back to the “oughts.”
Conclusion
I said already that vShield Zones is a 1.x product. It’s a great firewall, but it has a few gotchas that you need to consider. The benefits may outweigh the negatives. But vSphere is a 4.0 product.Some of this should be able to be addressed by tweaking vCenter or host settings.
vShield Zones should be smart enough to allow us to select specific port groups to protect rather than an entire vSwitch. I guess whatever scripting is being done in the background will need to be changed for this. Maybe we need a Ghetto vShield?
One of the REALLY smart people at VMware should be able to tell us the “order of migration” when a host is placed in Maintenance Mode. Once that is determined, there is probably a configuration file somewhere that we could tweak to change it.
There should be a way to set up automatic startup and shutdown of individual VMs. The Startup/Shutdown settings sort of deprecated once DRS was introduced. The only time it is useful is with a stand-alone server or in a NON-DRS cluster. I guess the only thing that could be done is to add a script somewhere in rc.d or rc.local to start up these VMs, but how can that be done in a “supported” fashion with ESXi and is it supported in either ESX or ESXi?
I brought these issues up with some VMware engineers and they assure me that they are working on this. Hopefully they will figure it out soon. I hate doing things manually. It seems like it is anti-cloud.
Dave,
I’m curious whether “Gotcha 1” only applies to the standard vSwitch? If I’m using vDS or Nexus 1000v, will I need to vMotion all VM’s off the vSphere Host before installing the vShield (for no outage)? I’ve got Nexus 1000V in my lab, maybe I’ll try this out…
Thanks,
Brad
@Brad Hedlund
I have not tried it yet, but from what I can see in the Admin Guide, using a vNDS will require an outage or you will need to vMotion everything off before making the changes. I didn’t see where it actually SAYS that in the instructions, but at some point you need to assign the pNICs to the vNDS. With a 1000v, you actually create a Virtual Service Domain, so you may be able to do it without more than a momentary outage.
DRS/HA seems to be an after thought with other VMware products as well. Take Lab Manager for example.
-In the earlier releases, VM portability wasn’t supported.
-Things got better in Lab Manager 3 where DRS/HA was now supported, however, not for configurations with VMs on an internal vSwitch (private network, network template, etc.)
-Now in Lab Manager 4, VMware advertises compatibility with prior restrictions by utilizing vDS, even saying that DPM was a great use case for Lab Manager (in concept, it is). Unfortunately, when fencing or connecting virtual to physical networks, a utility VM is created on the host which cannot be VMotioned or managed, which in turn prevents things like Maintenance Mode, DRS, DPM and patching from completing successfully. It’s a nasty catch 22.
VMware eventually gets there and they are by far still the innovation kings but some days are frustrating from a “what were you thinking” standpoint? I think some of the technology is rushed too quickly to market due to the need to stay ahead of the competition. 1.0 GA is released while it still has beta functionality baked in.
Jas
@Jason Boche
Don’t get me started with Lab Manager. You need to be part admin, part magician and part bartender to get it backed up properly. Like you said “what were you thinking?”
Currently I’m doing a rollout of Nexus 1000v with vShield Zones and I have got to say this is the absolute best way to implement it.
Basically you define a construct called a ‘Virtual Service Domain’ or VSD on your Nexus 1000v and you tie it to a couple of port-profiles that you assign to your vShield agent. An inside and outside. On either end of this is what is called a ‘service port.’
Then you do a ‘virtual-service-domain foo’ for VM port-profiles and anything bound to a different portgroup/VLAN/physical box gets funneled into the service port to be handled by the vShield.
Doing it this way means no vpxd hackery and also means its easy to set up on a 2 NIC box (such as a UCS blade or any blade with 2x10gig)
I would love to sit down and have a chat with some people more about how the VSD actually works (the mechanics under the hood.) If anyone from Cisco on the Tasman campus is up for it lunch is on me!
Thanks Dave for sharing this. It explained to me what I suspected all along. I think Gotcha #4 is upgrade. What happens to the VM when you upgrade/patch the vShield Zones VM?
Thanks Russell for sharing that Nexus addresses this. I’m still curious how it overcomes it as the “actual” switch is still local on each ESX.
e1@Singapore
The VSD basically takes anything bound for the physical wire or anything bound for a different port-profile and pushes it to the vShield agent. Since we aren’t creating a wholey separate vSwitch to put the VMs on, nothing changes from vCenter’s side of the house excepting that the vShield appears to automatically be working. Basically provides a software solution by forwarding VSD member traffic to the vShield.
When you patch the agent/service VM there is a brief outage (my preference to it opening up network traffic) with the current release.
Just an update:
As posted on http://www.virtualization.info/2010/03/on-vmware-vshield-zones-40-limitations.html
I did not know that vSHield Zones was really a 4.x release. And I may not have made it clear enough here. Only gotcha #1 is vShield related. Everything else has to do with vCenter integration.
vshield is a ‘great fw’ ? are u kidding me ? look at any commercial fw and tell me if vshiled has 1% of the capabilities ? u do not even have managed named objects btw (net/user/port etc ..) ? do you even have any 50$ fw functionality there ?
@manythanks
Actually, I think it IS a great firewall for the price (FREE!). And you can manage it with pretty good granularity. This post was about the original vShield Zones that came with vSphere 4.0. The new vShield line that is currently GA is even better. But I am not a firewall “guru” and would rely more on the expertise of someone that I know to be more of an expert in these things. I have not confirmed everything yet, but most of these “gotchas” are eliminated in the new product.
Dave