In this article I will go through design options for vCenter Server and Platform Services Controller (PSC) for two-site deployment. I made some assumptions to narrow number of options for given requirements.
Assumptions:
- Greenfield deployment, I will not take into consideration any migration scenarios
- No vSphere Storage Metro Cluster/Stretched Cluster
- VMware vSphere Enterprise Plus licence
- vSphere 6.5
Requirements:
- Design must cover two datacenters, close to each other
- vCenter Server & PSC outage time must be minimized, preferably no SPOF
Design considerations
First of all, before going to the design options, we need to answer some questions, among many others, we need to know:
- size of the environment and required further possibility to scale
- if load balancer can be used
- latency and throughput between sites
Secondly, during the design we need to take some decisions:
vCenter Server: Windows vs Linux Appliance (VCSA)
This is pretty straightforward. In appliance version you have native vCenter High Availability, native backup/restore and VUM included. It is easier to deploy, lighter, with smaller attack surface. Besides that VMware has deprecated vCenter Server for Windows with the release of vSphere 6.7. The next major version of vSphere will not include Windows version at all. One narrow case comes to my mind, in greenfield deployments, when you could consider Windows deployment. If you plan to use OS agent monitoring this will not work for VCSA as it is a blackbox. Still, I would use VCSA and monitor it using vROPS or other agentless tool.
When comes to database, VCSA bundled PostgreSQL database supports up to 1,000 hosts and 10,000 virtual machines so usage of external Oracle database is not needed in most cases.
vCenter Server: High Availability
In vSphere 6.5 new feature was introduced – vCenter Server High Availability. It deploys three appliances: active, passive and witness node. The three nodes communicate over a private network and continuously replicate data, but only the active node has an active management interface. During a failure passive node takes over all vCenter tasks and becomes active. Here you can find VMware whitepaper describing vCenter HA in more details and performance impact when it is enabled.
During the design for multi-site environment we need to take into consideration maximum latency between sites. A design assumption of good network (less than 10 ms and high bandwidth) connectivity between the active and passive node is made to guarantee zero recovery point objective (RPO). It is worth noting that vCenter HA supports both embedded and external Platform Services Controller, it will be important when we will go through design options.
PCS: external vs embedded deployment
Platform Services Controller can be installed in two deployment types, embedded into vCenter or as a separate instance. External deployment is required when you want to deploy more vCenter Servers and connect them via Enhanced Linked Mode. This type of deployment also gives you better scalability as you can separate PSC and vCenter Server services. Be aware that in vSphere 6.7 you don’t need external PSC to use ELM.
PSC: High Availability
It is possible to configure PSC in HA mode. This setup requires external load balancer, either physical or virtual. The load balancer is not actually load balancing the incoming requests and spreading the load across the different backend PSC nodes. It may seam it does, as all PSCs behind the load balancer are in an multi-master replication mode, but the load balancer itself has been configured to affinitzed to just a single PSC node. From the vCenter Server’s point of view, only a single PSC is really active in servicing the requests. Theoretically you could survive without load balancer and change vCenter to second PSC manually (or by script) when first fails, but repointing the active node of a VCHA cluster to a different PSC is unsupported.
Here again we need to discuss latency. VMware recommends no higher than 100 ms RTT between Platform Services Controller spanning sites and no higher than 10 MS RTT between PSCs within a site.
Enhanced Linked Mode
This is ideal for larger environments, where there is a need for a single-pane-of-glass view into the environment and where there are multiple vCenter Servers. Enhanced Linked Mode lets you view and search across all linked vCenter Server systems and replicate roles, permissions, licenses, policies, and tags. This doesn’t bring high availability for vCenter services, just simplifies management of multiple vCenters. Also, as mentioned before, external PSC is needed in vSphere 6.5.
Option 1: single vCenter Server in HA
Lets assume that we want to design highly scalable environment and with high available management services. Important limitation is that latency between sites must be lower that 10 ms. In that case we can install one vCenter Server in HA mode and connect ESXi hosts from both sites. The solution is resilient to failure of single component – vCenter and PSC, that also includes scenario when whole SITE A is down.
RTO: less than 2 minutes through API clients, less than 4 minutes through UI clients (max 5 minutes).
Design element | Design decision | Justification | Implications |
VCSA or Windows | VCSA | Native HA. | No OS agent monitoring possible. |
Number of vCenter Servers | 1 | Simpler deployment. | All ESXi hosts connected to single vCenter (cross site). |
vCenter HA | Yes | Improves vCenter Server availability. | Latency between sites must be lower than 10 ms. |
External PCS | Yes | Improves PSC scalability. | Additional instance needed. |
PSC HA | Yes | Improves PSC availability. | External load balancer needed, preferably in cluster configuration. |
Enhanced Link Mode | No | Just one vCenter used. | – |
IMPORTANT NOTE
I would like to point out very significant implication of such design. This approach is not fault-tolerant to whole site B failure. In that case passive and witness node will go down and active node declares himself as isolated and all services will be stopped. Good recommendation here is to place passive and witness node on different ESXi host and datastore, maybe even cluster when possible.
Option 2: two vCenter Servers
Now discussing different scenario when we have higher latency between sites. In that case we cannot span vCenter HA cluster between sites, instead we will deploy two vCenters in Enhanced Linked Mode with single SSO domain. Solution is resilient to one PSC failure and vCenter Server failure will impact just one site. In scenario when vCenter virtual machine breakdowns (hardware, datastore issue, etc) it can be restarted by vSphere HA.
RTO: for virtual machine failure around 20 minutes to restart VM by vSphere HA, depending on setup.
Design element | Design decision | Justification | Implications |
VCSA or Windows | VCSA | Easier to deploy, lighter, with smaller attack surface. | No OS agent monitoring possible. |
Number of vCenter Servers | 2 | Separates fault domain to one site. | More difficult to manage, ELM can be introduced. |
vCenter HA | No | Although enabling vCenter HA for each site is possible that would increase complexity. | Solution RTO will increase. vSphere HA can be enabled. |
External PCS | Yes | Required for ELM. | Additional instance needed. |
PSC HA | Yes | Improves PSC availability. | External load balancer needed, preferably in cluster configuration. |
Enhanced Link Mode | Yes | Single-pane of glass management. |
Reference Materials
VMware vCenter Server High Availability Performance and Best Practices
Great post serie about vCenter HA
Enhanced Linked Mode (ELM) vs Hybrid Linked Mode (HLM)
Like!! I blog quite often and I genuinely thank you for your information. The article has truly peaked my interest.