Highly available VCSA & PSC 6.5 for two sites

In this article I will go through design options for vCenter Server and Platform Services Controller (PSC) for two-site deployment. I made some assumptions to narrow number of options for given requirements.

Assumptions:

  • Greenfield deployment, I will not take into consideration any migration scenarios
  • No vSphere Storage Metro Cluster/Stretched Cluster
  • VMware vSphere Enterprise Plus licence
  • vSphere 6.5

Requirements:

  • Design must cover two datacenters, close to each other
  • vCenter Server & PSC outage time must be minimized, preferably no SPOF

Design considerations

First of all, before going to the design options, we need to answer some questions, among many others, we need to know:

  • size of the environment and required further possibility to scale
  • if load balancer can be used
  • latency and throughput between sites

Secondly, during the design we need to take some decisions:

vCenter Server: Windows vs Linux Appliance (VCSA)

This is pretty straightforward. In appliance version you have native vCenter High Availability, native backup/restore and VUM included. It is easier to deploy, lighter, with smaller attack surface. Besides that VMware has deprecated vCenter Server for Windows with the release of vSphere 6.7. The next major version of vSphere will not include Windows version at all. One narrow case comes to my mind, in greenfield deployments, when you could consider Windows deployment. If you plan to use OS agent monitoring this will not work for VCSA as it is a blackbox. Still, I would use VCSA and monitor it using vROPS or other agentless tool.

When comes to database, VCSA bundled PostgreSQL database supports up to 1,000 hosts and 10,000 virtual machines so usage of external Oracle database is not needed in most cases.

vCenter Server: High Availability

In vSphere 6.5 new feature was introduced – vCenter Server High Availability. It deploys three appliances: active, passive and witness node. The three nodes communicate over a private network and continuously replicate data, but only the active node has an active management interface. During a failure passive node takes over all vCenter tasks and becomes active. Here you can find VMware whitepaper describing vCenter HA in more details and performance impact when it is enabled.

During the design for multi-site environment we need to take into consideration maximum latency between sites. A design assumption of good network (less than 10 ms and high bandwidth) connectivity between the active and passive node is made to guarantee zero recovery point objective (RPO). It is worth noting that vCenter HA supports both embedded and external Platform Services Controller, it will be important when we will go through design options.

PCS: external vs embedded deployment

Platform Services Controller can be installed in two deployment types, embedded into vCenter or as a separate instance. External deployment is required when you want to deploy more vCenter Servers and connect them via Enhanced Linked Mode. This type of deployment also gives you better scalability as you can separate PSC and vCenter Server services. Be aware that in vSphere 6.7 you don’t need external PSC to use ELM.

PSC: High Availability

It is possible to configure PSC in HA mode. This setup requires external load balancer, either physical or virtual. The load balancer is not actually load balancing the incoming requests and spreading the load across the different backend PSC nodes. It may seam it does, as all PSCs behind the load balancer are in an multi-master replication mode, but the load balancer itself has been configured to affinitzed to just a single PSC node. From the vCenter Server’s point of view, only a single PSC is really active in servicing the requests. Theoretically you could survive without load balancer and change vCenter to second PSC manually (or by script) when first fails, but repointing the active node of a VCHA cluster to a different PSC is unsupported.

Here again we need to discuss latency. VMware recommends no higher than 100 ms RTT between Platform Services Controller spanning sites and no higher than 10 MS RTT between PSCs within a site.

Enhanced Linked Mode

This is ideal for larger environments, where there is a need for a single-pane-of-glass view into the environment and where there are multiple vCenter Servers. Enhanced Linked Mode lets you view and search across all linked vCenter Server systems and replicate roles, permissions, licenses, policies, and tags. This doesn’t bring high availability for vCenter services, just simplifies management of multiple vCenters. Also, as mentioned before, external PSC is needed in vSphere 6.5.

Option 1: single vCenter Server in HA

Lets assume that we want to design highly scalable environment and with high available management services. Important limitation is that latency between sites must be lower that 10 ms. In that case we can install one vCenter Server in HA mode and connect ESXi hosts from both sites. The solution is resilient to failure of single component – vCenter and PSC, that also includes scenario when whole SITE A is down.

RTO: less than 2 minutes through API clients, less than 4 minutes through UI clients (max 5 minutes).

Design element Design decision Justification Implications
VCSA or Windows VCSA Native HA. No OS agent monitoring possible.
Number of vCenter Servers 1 Simpler deployment. All ESXi hosts connected to single vCenter (cross site).
vCenter HA Yes Improves vCenter Server availability. Latency between sites must be lower than 10 ms.
External PCS Yes Improves PSC scalability. Additional instance needed.
PSC HA Yes Improves PSC availability. External load balancer needed, preferably in cluster configuration.
Enhanced Link Mode No Just one vCenter used.

 

VCSA and PSC deployment option 1

IMPORTANT NOTE

I would like to point out very significant implication of such design. This approach is not fault-tolerant to whole site B failure. In that case passive and witness node will go down and active node declares himself as isolated and all services will be stopped. Good recommendation here is to place passive and witness node on different ESXi host and datastore, maybe even cluster when possible.

Option 2: two vCenter Servers

Now discussing different scenario when we have higher latency between sites. In that case we cannot span vCenter HA cluster between sites, instead we will deploy two vCenters in Enhanced Linked Mode with single SSO domain. Solution is resilient to one PSC failure and vCenter Server failure will impact just one site. In scenario when vCenter virtual machine breakdowns (hardware, datastore issue, etc) it can be restarted by vSphere HA.

RTO: for virtual machine failure around 20 minutes to restart VM by vSphere HA, depending on setup.

Design element Design decision Justification Implications
VCSA or Windows VCSA Easier to deploy, lighter, with smaller attack surface. No OS agent monitoring possible.
Number of vCenter Servers 2 Separates fault domain to one site. More difficult to manage, ELM can be introduced.
vCenter HA No Although enabling vCenter HA for each site is possible that would increase complexity. Solution RTO will increase. vSphere HA can be enabled.
External PCS Yes Required for ELM. Additional instance needed.
PSC HA Yes Improves PSC availability. External load balancer needed, preferably in cluster configuration.
Enhanced Link Mode Yes Single-pane of glass management.

 

VCSA and PSC deployment option 2

Reference Materials

VMware vCenter Server High Availability Performance and Best Practices
Great post serie about vCenter HA
Enhanced Linked Mode (ELM) vs Hybrid Linked Mode (HLM)

Comments 1

Leave a Reply