It looks like I may have won funding to go all the way to "Fault Tolerance" in my VMWare configuration at work. Due to recent streamlining in our servers and applications, all the current servers can run on a single hex-core i7 with some SSDs if necessary. I think this means I should be able to get away with three servers:
ESXi #1
ESXi #2
NAS/iSCSI host
Sounds like a fun project, I hope it ends up meeting the requirements. The Fault Tolerance (FT) feature has some limitations with regards to virtualized resources, so I hope that whatever you plan to virtualize and run in FT mode, fits the requirements before you undertake this project. The list of requirements
can be seen here and an
FAQ here. There is more requirements and restriction information
available here. Basically you'll need to see if your intended GuestOS is on the compatibility list along with the physical hardware you're using. You'll also need to make sure the intended size of RAM (64GB MAX), CPUs (only 1 vCPU supported), disks (16), and many other items are listed in the last URL.
I was also reading about people running the iSCSI host as a VM on one of the machines. Which brings me to my first question:
How can it be fault tolerant if there is only one storage machine? Surely this single point of failure negates much of the tolerance? Is failure of the storage system so much less likely that this is a common scenario? Can the iSCSI host also have fail-over redundancy somehow?
This one is tricky and certainly feasible. You can run a NAS device as a VM and share iSCSI targets. I haven't confirmed this first hand, but in theory, I don't see why this wouldn't work. However, I wouldn't recommend it.
There are two parts to answer your second question. First, it wouldn't be completely fault tolerant in this situation for the reasons you've suggested. This is why I wouldn't recommend a VM as your iSCSI/NFS storage device.
The second part can be answered in the case of a physical storage array. In the case of physical storage arrays on the enterprise level (which may typically be associated with a customer who is considering shelling out the money for an enterprise license) is that many arrays have multiple storage processors which are intended on handling failures. Each SP will have it's own power supply, fans, etc. In the array I manage at work, there are two storage processors. Each SP has two (often times more) Fibrechannel cable connections which run to two different FC HBAs inside the ESXi host making sure to crisscross paths. Something like this:
SPA 1 -> HBA 1 port 1
SPA 2 -> HBA 2 port 1
SPB 1 -> HBA 1 port 2
SPB 2 -> HBA 2 port 2
(normally there might be a couple of FC switches in this mix, but I left them out to remove the complexity in the illustration)
Anyway, in the case of an SP failure (SPA in this example), the multipathing is covered and the array issues a LUN tresspass from SPA to SPB which in turn continues IO operation to the LUN(s) without downtime. ESXi is multipath-aware and you can configure how it manages the HBA ports.
Also, the storage array is also provided with its own enormous battery for the case of a power failure. It would wait for a specified period of time before issuing abort commands and then destaging the data in cache back on to the disks before powering down.
In larger environments, two storage arrays would be used and connected in redundant separate paths with other mirroring technologies to allow for a complete array failure. The LUNs would be mirrored in full-sync between the two seperate storage arrays. That's how the storage side of this can be managed as a fault tolerant...but at the downside of a much increase in cost. Only the business can dictate how much it costs to have downtime and what the rate per minute would cost the company in the event of a failure and downtime. That's how some of these configs get justified for our financial partners.
Long story short, you've identified the weakest link in your configuration as your storage device and only you can make the call if it's worth considering an FT configuration with this exception.
You will also want to make sure the networking component is also redundant and sufficient for an FT implementation. Gigabit Ethernet is the minimum and 10Gb is often recommended in order to keep the FT VM in sync if there will be multiple FT implementations. Ths will also need to be a dedicated connection. If it cannot keep up, there will some performance issues. There is a nice PDF with lots of diagrams and a
day in the life of an FT VM here.
Based on my understanding of the VMWare licensing, fault tolerance requires
vSphere Enterprise @ $3594/processor (license plus one year support) and an instance of vCenter. As I am only running two ESXi machines,
vCenter Foundation should work (supports up to 3 servers) @ $2140 (license plus one year support). So that is $9328 in licenses, which I can handle. Anyone familiar enough to back this up?
The licensing components aren't my strongest area because it's not something we typically have to worry about. I know that's a poor excuse and it certainly blinds me to the pain that real-paying customers have to deal with this.
Yes, you are correct that you will need at least the vSphere enterprise license in order to get the FT feature enabled. Make sure that the amount of vRAM this license offers meets your needs, but also remember that in a FT configuration that the mirrored (invisible) host that will be running on the second ESXi will consume the same amount of vRAM with respect to the limit on your license. For the utmost clarity, I would suggest calling VMware with what your needs are and get a quote.
Keep in mind that you can setup and configure this entire environment and use the unrestricted 60-day trial to ensure the configuration meets your needs before buying licenses.