NSX Cluster Sizing – why you need 4 ESXi Hosts

Host with NSX ESG Fails
Host with NSX ESG Fails

While this article is talking about ESXi Clusters for The Federation Enterprise Hybrid Cloud 3.1 which contains what is called an NEI Pod it is somewhat applicable to Edge Clusters in any NSX deployment (however VMware only recommend 3 hosts). I work on FEHC for EMC so this is to explain why 4 hosts are mandated for the NEI pod.

Whether its an NEI Pod, an Edge Cluster, or any other name we are referring to some dedicated hardware for running VMware NSX Edge Services Gateway’s (ESG’s) and Distributed Logical Router (DLR) Controllers. In FEHC these look like this:

Federation Enterprise Hybrid Cloud 3.1 Clusters

Federation Enterprise Hybrid Cloud 3.1 Clusters

The issue is that many customers don’t want to waste a minimum of 4 blades just for the NEI Pod. The solution was a collapsed Management Pod that looked like this:

Federation Enterprise Hybrid Cloud 3.1 Collapsed Management

Federation Enterprise Hybrid Cloud 3.1 Collapsed Management *Note: Blades here are for illustration, sizing is dependent on many factors.

This saves valuable hardware resources, but some customers still want to keep the NEI pod seperate, perhaps add extra NIC’s or move from Cisco UCS B Series, to CISCO UCS C Series to allow for greater bandwidth, etc. This always leads to a discussion that there are only 3 NSX Controllers so why do they need 4 ESXi hosts.

Let’s look at how routing updates are sent from the DLR Control VM’s to the DLR within each ESXi host.

The ESG’s are peered with the DLR Control VM’s which then send routing updates to the NSX Controller, and from there to the ESXi hosts which contain the DLR.

Traffic flows normally

Traffic flows normally

Now if we have a failure of an ESXi host that is containing our ESG

Host with ESG Fails

This would mean that traffic flowing over the failed ESG will need to be routed over a surviving ESG, this would happen as depicted:

Host with NSX ESG Fails

Host with NSX ESG Fails

All here is fine, but what would happen if we have an ESG and our Active DLR Control VM on the same host

Host 1 Failed with NSX Edge and DLR Controller

Host 1 Failed with NSX Edge and DLR Controller

This is more serious as the passvie DLR control VM first has to realise there has been a failure of the active DLR Control VM, then become active, before sending updates to the NSX controllers. This adds to the time take to route traffic over the remaining ESG’s, see below:

NSX ESG and NSX DLR Control VM Failed

NSX ESG and NSX DLR Control VM Failed

So that explains why you need 4 ESXi hosts, but we also need DRS rules to separate:

  • DLR Control VM’s from our ESG’s
  • NSX Controllers
  • NSX Load Balancers

We can make optimal use of our hardware by reversing the layout for the Blue DLR’s and ESG’s and the Green DLR’s and ESG’s as shown in the below diagram.

Better NSX Device to ESXi Placement

Better NSX Device to ESXi Placement

As always if you have any comments or have spotted any mistakes please leave a comment.

4 Comments

Comments are closed.