This is Part 12 of the Nutanix XCP Deep-Dive, covering ESXi design considerations.
This will be a multi-part series, describing how to design, install, configure and troubleshoot an advanced Nutanix XCP solution from start to finish for vSphere, AHV and Hyper-V deployments:
- Nutanix XCP Deep-Dive – Part 1 – Overview
- Nutanix XCP Deep-Dive – Part 2 – Hardware Architecture
- Nutanix XCP Deep-Dive – Part 3 – Platform Installation
- Nutanix XCP Deep-Dive – Part 4 – Building a Nutanix SE Toolkit
- Nutanix XCP Deep-Dive – Part 5 – Installing ESXi Manually with Phoenix
- Nutanix XCP Deep-Dive – Part 6 – Installing ESXi with Foundation
- Nutanix XCP Deep-Dive – Part 7 – Installing AHV Manually
- Nutanix XCP Deep-Dive – Part 8 – Installing AHV with Foundation
- Nutanix XCP Deep-Dive – Part 9 – Installing Hyper-V Manually with Phoenix
- Nutanix XCP Deep-Dive – Part 10 – Installing Hyper-V with Foundation
- Nutanix XCP Deep-Dive – Part 11 – Benchmark Performance Testing
- Nutanix XCP Deep-Dive – Part 12 – ESXi Design Considerations
- Nutanix XCP Deep-Dive – Part 13 – AHV Design Considerations
- Nutanix XCP Deep-Dive – Part 14 – Hyper-V Design Considerations
- Nutanix XCP Deep-Dive – Part 15 – Data Center Facility Design Considerations
- Nutanix XCP Deep-Dive – Part 16 – The Risks
- Nutanix XCP Deep-Dive – Part 17 – CVM Autopathing with ESXi
- Nutanix XCP Deep-Dive – Part 18 – more to come as the series evolves (Cloud Connect to AWS and Azure, Prism Central, APIs, Metro, DR, etc.)
I have aggregated all of the design considerations I could find that need to be assessed in a Nutanix XCP architecture design with VMware vSphere/ESXi. Brevity and bullet-points are used to keep the information concise and readable. If you want more information on a concept use the NPX Link-O-Rama.
This post will be updated with additional information as part of the NPX Link-O-Rama. If you have content to contribute, post a comment below.
I have separated the design decisions into the areas specified by the NPX blueprint.
Business Goals
- What are the business goals of the solution?
Requirements/Constraints/Assumptions
- What are the requirements, constraints and assumptions of the solution?
Risks
A. Data Center Facility
Logical Design Decisions
- Single-site or Multi-site Data Center Facilities?
- Data Center type – “Bricks & Mortar”, Co-location, Pre-Fabricated or Performance Optimised Data Centers?
- Management & Control Plane will be separated from the Data Plane?
Physical Design Decisions
- Physical location(s) of Data Center Facilities?
- Distances between Data Centers?
- Type of Data Center Facilities?
- Power and Cooling requirements for the solution?
- Can the Data Center Facility handle high density infrastructure?
- Rack layouts for the solution?
B. Virtual Infrastructure Management
Logical Design Decisions
- Number of Pooled Compute, Network and Storage resources?
- What services are you delivering?
- Required availability levels of virtualisation management systems?
- 3rd party integrations: IT Service Management, Infrastructure Management systems, Enterprise services (DNS, LDAP, NTP, PKI, Syslog, SNMP Traps), Vendor Data collection
- Advanced Operations
- Hypervisor Workload Protection mechanisms?
- Hypervisor Workload Resource Balancing mechanisms?
Physical Design Decisions
- Hypervisor: ESXi and which version? (Hyper-V and AHV have been dropped to align with the Conceptual Model/Logical Design)
- Dedicated Management Cluster?
- Standalone or Linked-Mode vCenters?
- vCenter Server version and installation type?
- vCenter Server database?
- vCenter Server protection mechanism?
- vSphere components that will be used? Only use what you need.
- Host profiles?
- Update management of ESXi, VM Hardware version and VMtools?
- Consider using AOS One-Click upgrades for ESXi. They are validated by Nutanix before the JSON file is published.
- Antivirus integration via vShield? If yes, with vCNS vShield Manager or NSX Manager?
- vRealize Suite being used for advanced operations and cloud management?
- vRO for workflows?
- Enterprise Management solution to integrate with?
- Prism Central to aggregate clusters?
- Automated vendor support mechanisms? How will vCenter Support Assistant, Nutanix Pulse and Nutanix Remote Access be used?
- Any Service Desk or Change Management requirements that must be met?
- What are the vSphere HA and vSphere DRS requirements?
- Make sure you understand the mandatory Nutanix configuration settings for vSphere.
- DNS and NTP integration?
- Role Based Access Control and LDAP integration?
- Which vCenter Server User Interface for Administration and Operations?
- What vSphere and Nutanix licencing is required?
- 3rd party software licencing considerations? Per physical socket/core or vCPU? DRS VM-Host rules required?
C. Compute
Logical Design Decisions
- Traditional Monolithic Compute, Server-Side Flash Cache Acceleration with legacy infrastructure, Converged Infrastructure or Hyper-Converged Infrastructure? Obviously this must align with the Storage section.
- Minimum number of Hypervisor Hosts per Cluster
- Host sizing: Scale Up or Scale Out?
- Homogeneous or Heterogeneous nodes?
- Number of Sockets per Host?
- Host Spanning for Failure Domains?
- Required CPU Capacity?
- Required Memory Capacity?
Physical Design Decisions
- HCI Vendor: Nutanix XCP, Dell XC or Lenovo (all other vendors/technologies have been dropped to align with the Logical Design/Conceptual Model)
- Processor type: Intel (AMD not supported by Nutanix)
- Intel CPU Features: VT-x supported, Hyper-threading, Turbo Boost, NUMA enabled?
- Cluster Hardware and Configuration?
- Inter-Mix rules are being followed?
- Number of vSphere clusters per Nutanix cluster?
- Nutanix family and model number?
- Number of CPU sockets per node?
- Model of Intel Processor, number of cores and GHz per core?
- GPU required?
- Host locations?
- Single Rack, Multi-Rack with striping?
- Cluster Availability requirements?
- Nutanix Redudancy Factor?
- Nutanix Availability Domains?
- Align compute availability with storage availability?
- Future expansion?
D. Storage
Logical Design Decisions
- Traditional Monolithic Storage, Server-Side Flash Cache Acceleration with legacy infrastructure, Converged Infrastructure or Hyper-Converged Infrastructure? Obviously this must align with the Compute section.
- Block-based or IP-based Storage Access?
- Homogeneous or Heterogeneous storage nodes?
- Automated storage management?
- RDM devices allowed?
- Hypervisor boot method? DAS, LUN or PXE?
- Thin or Thick provisioning for Back-end and VMs?
- Required storage resources (performance and capacity)?
- Storage replication?
Physical Design Decisions
- HCI Vendor: Nutanix XCP, Dell XC or Lenovo and AOS version? (all other vendors/technologies have been dropped to align with the Logical Design/Conceptual Model) Obviously this must align with the Compute section.
- Usable Storage Calculation, considering Storage Pools, Replication Factor, Usable Capacity and Usable Performance?
- Number of SSD and HDD drives per Node?
- Nutanix used to publish the Diagnostics results in the release notes of each NOS version, but has stopped doing this.
- Also consider Number of Containers, Free-Space Reservations, Deduplication, Compression, Erasure Coding and Acropolis Volumes API.
- Controller VM Sizing across the cluster?
- Capacity nodes required for existing or new clusters?
- Inter-Mix rules are being followed?
- The performance of each release is very subjective and the Diagnostics results are useful as an indicator and benchmark for basic verification.
- Proper verification of storage performance should be validated during the Test Phase of the Implementation Plan.
- The Public version of the Nutanix Sizer Tool does not include storage performance, only capacity. Contact your Nutanix Partner for a cluster design that meets your required performance profile.
- Active Working Set required for each node?
- Self-Encrypting Disks? If yes, consider the KMS requirements.
- ESXi host boot must be from SATA-DOM (USB), this is a Nutanix constraint.
- Default Auto-Tiering (ILM) thresholds?
- VM DirectPath I/O and SR-IOV cannot be used, this is a Nutanix constraint.
- Datastores per Nutanix cluster? Ideally, go with one Datastore per vSphere Cluster.
- Storage DRS and SIOC being used? This is not required.
- Different VMDK shares being used?
- VAAI being used?
- VASA and VM storage profiles? VASA not supported and VM storage profiles could be manually configured for a multi-container cluster with different settings.
- Asynchronous DR, Metro or Synchronous DR required? (mentioned again in Backup/Recovery and BC/DR sections)
- Future expansion?
E. Network
Logical Design Decisions
- Legacy 3-Tier Switch, Collapsed Core or Clos-type Leaf/Spine?
- Clustered Physical or Standalone EoR/ToR Switches?
- Stretched or Per Rack VLANs?
- Functional traffic types separated with vSwitches or VLANs?
- Jumbo Frames?
- Quality of Service?
- Load Balancing?
- IP version?
- Inter-Data Center links, including RTT?
- Required Network Capacity?
- Single vNIC or Multi vNIC VMs allowed?
Physical Design Decisions
- Clos-type Leaf/Spine vendor selection for large installations?
- Blocking or non-blocking Data Center switch fabric?
- If blocking, what is the over-subscription ratio?
- What is the traffic path for North/South and East/West traffic?
- Where are the Layer 3 gateways for each IP Subnet?
- Any Dynamic Routing requirements?
- Is Multi-Cast required?
- End-to-End Jumbo Frames?
- Host interfaces: 1GbE and/or 10GbE? How many per node?
- LAGs or unbonded host interfaces?
- Management overlay required for KVM and IPMI?
- Physical LAN Performance?
- Host interface connectivity matrix?
- Metro Ethernet required between Data Centers?
- QoS, Network Control and vSphere Network I/O Control?
- Edge QoS enforced or End-to-End QoS?
- vSphere NIOC System and User-Defined Network Resource Pools?
- Multi-NIC vMotion?
- VLAN Pruning?
- Spanning Tree considerations?
- VM DirectPath I/O and SR-IOV cannot be used, this is a Nutanix constraint.
- TCP Offload enabled?
- VSS, VDS or Cisco Nexus 1000V?
- Separate vSwitches per Cluster or shared?
- Teaming and Load Balancing?
- VMkernel ports?
- Portgroups?
- VMware NSX-v required?
- Future Expansion?
F. Backup/Recovery
Logical Design Decisions
- VM Image Backup Frequency?
- Application and Database Consistent Backup Frequency?
- Backup Restore Times?
- Physical Separation of Operational Data and Backup Data?
- Required Backup Resources
- Required Backup and Restore Performance
Physical Design Decisions
- VADP used?
- Backup/Recovery solution?
- Backup/Restore mechanism?
- VM-Centric Snapshots?
- Async DR Replication of VM-Centric Snapshots to remote cluster/cloud connect (AWS/Azure)?
- Backup frequency?
- Retention period?
- Backup capacity and performance?
- Fast restore of management cluster direct to host?
- Future expansion?
G. Virtual Machine
Logical Design Decisions
- Standard VM T-shirt sizes?
- VM CPU and RAM management mechanisms used?
- Location of VM files?
- Guest OS standardisation?
- 64-bit and 32-bit?
- Templates used?
Physical Design Decisions
- Standard VMs of what size?
- vApps and Resource Pools?
- VM files on shared storage?
- Standard vDisk setups per VM?
- Thin provisioned vDisks?
- Nutanix or vSphere Snapshots allowed?
- CBT enabled?
- 64-bit/32-bit Guest OS versions?
- vSCSI adapters?
- vNIC adapters?
- VM Hardware version?
- VMtools installed and version?
- VM Options?
- VM Templates?
- VM Template Repository?
- Mission-Critical/Business-Critical Application considerations?
H. Security
Logical Design Decisions
- Zones of Trust?
- Defence-in-Depth?
- Multi-Vendor?
- Physical separation requirements?
- Compliance standards?
- Virtualisation security requirements?
- Required Network Security Capacity?
Physical Design Decisions
- Physical and Virtual Network Zoning?
- Application-level, Network-level Firewalls?
- IDS and IPS?
- SSL and IP-Sec VPNs?
- Unified Threat Management?
- Vendor selection?
- VMware vCNS/NSX-v required?
- Anti-Virus? Endpoint Protection?
- Network Security Performance?
- Security Information & Event Management (SIEM)?
- Public Key Infrastructure (PKI)?
- Nutanix Cluster security? STIG?
- ESXi host security?
- Network security?
- Storage security?
- Backup security?
- VM security?
- Future Expansion?
I. BC/DR
Logical Design Decisions
- Protection Mechanisms?
- Manual or Automated Run-books?
- RPO, RTO, WRT and MTD of Mission-Critical, Business-Critical and Non-Critical applications?
- Global Site Load Balancers?
- DNS TTL for clients?
Physical Design Decisions
- DR Automation solution?
- VMware SRM?
- GSLB solution?
- Internal and External DNS servers?
- Metro or Synchronous DR to remote clusters?
- Multi-Site Application, Database or Message Queue clustering/replication?