This is Part 16 of the Nutanix XCP Deep-Dive, covering the common risks and pitfalls associated with moving to the Hyper-Converged Infrastructure platform of any vendor.
This will be a multi-part series, describing how to design, install, configure and troubleshoot an advanced Nutanix XCP solution from start to finish for vSphere, AHV and Hyper-V deployments:
- Nutanix XCP Deep-Dive – Part 1 – Overview
- Nutanix XCP Deep-Dive – Part 2 – Hardware Architecture
- Nutanix XCP Deep-Dive – Part 3 – Platform Installation
- Nutanix XCP Deep-Dive – Part 4 – Building a Nutanix SE Toolkit
- Nutanix XCP Deep-Dive – Part 5 – Installing ESXi Manually with Phoenix
- Nutanix XCP Deep-Dive – Part 6 – Installing ESXi with Foundation
- Nutanix XCP Deep-Dive – Part 7 – Installing AHV Manually
- Nutanix XCP Deep-Dive – Part 8 – Installing AHV with Foundation
- Nutanix XCP Deep-Dive – Part 9 – Installing Hyper-V Manually with Phoenix
- Nutanix XCP Deep-Dive – Part 10 – Installing Hyper-V with Foundation
- Nutanix XCP Deep-Dive – Part 11 – Benchmark Performance Testing
- Nutanix XCP Deep-Dive – Part 12 – ESXi Design Considerations
- Nutanix XCP Deep-Dive – Part 13 – AHV Design Considerations
- Nutanix XCP Deep-Dive – Part 14 – Hyper-V Design Considerations
- Nutanix XCP Deep-Dive – Part 15 – Data Center Facility Design Considerations
- Nutanix XCP Deep-Dive – Part 16 – The Risks
- Nutanix XCP Deep-Dive – Part 17 – CVM Autopathing with ESXi
- Nutanix XCP Deep-Dive – Part 18 – more to come as the series evolves (Cloud Connect to AWS and Azure, Prism Central, APIs, Metro, DR, etc.)
In my opinion, Hyper-Converged Infrastructure (HCI) is the future of Private and Hybrid Cloud infrastructure. If you are designing a greenfield data center and HCI is not on your list for serious consideration, then it should be.
With that being said, every technology has its Pros and Cons; nobody rides for free. The advantages of HCI are many and outweigh the disadvantages, here is what you need to watch out for:
- People – If your server virtualisation, network and storage teams operate in silos and are constantly at war with each other, then HCI is probably not for you at this time. Successful HCI projects are built upon a very close collaboration between these teams. In fact, it makes more sense to merge these three teams into one “Enterprise Infrastructure” team. It is also very important to cross-skill these team members and let them evolve into “Enterprise Architects”, “Enterprise Administrators” and “Enterprise Operators”. However, make sure you keep your Backup/Recovery/Archive responsibilities separate (see RBAC point below).
- Data Center Facilities – A data center full of legacy, 3-tier infrastructure is not the same as one packed with HCI. The resource density ratio is around 4-8 to 1 depending upon your current legacy infrastructure. You need to design for 25+kW racks with a matching cooling system. If you use a traditional, legacy data center (designed for 5-8kW per rack), then you will have problems down the road (hot spots and power exhaustion).
- Switch Fabric – By moving to HCI, you need a scalable LAN fabric that provides non-blocking throughput for East-West traffic. Legacy network switch design (Core – Distribution – Aggregation – Access layers) is not going to cut it for large scale HCI, which is optimised for North-South traffic. You may get away with it initially, however you will need plans to migrate to a non-blocking leaf and spine switched LAN. HCI has made Fiber Channel infrastructure obsolete, but the same principles that drove SAN design now apply to your LAN with the move to IP storage.
- Controller VM – The storage processor of legacy storage arrays has now become a virtual appliance running on the host itself. Make sure your administration/operations staff, Standard Operating Procedures and monitoring systems understand the importance and give it the respect it deserves. The current version of NOS with ESXi still allows vSphere administrators to modify the CVM (Nutanix Acropolis does not allow this for CVMs with Nutanix KVM). For example, an untrained vSphere Administrator powers off all Nutanix Controller VMs and reduces the RAM from 24GB to 8GB to provide additional resources for adding new VMs across the entire cluster.
- Role Based Access Control – When I consider failure scenarios for Business Continuity and Disaster Recovery, my nightmare risk is not a natural disaster, but the disgruntled rogue administrator, who has all of the keys to the kingdom, taking out every system. With HCI and the “Enterprise Administrator”, this risk is compounded. So it is very important to separate the administration/operations responsibilities for operational data and backup/recovery/archive. This way if either one is wiped out across all data centers, you still have the other to recover from. Apply this concept to physical data center security as well.
- Data Locality and the Working Set – “Data Locality” is the amount of local storage resources (capacity and performance) presented via the Controller VM to the Hypervisor for serving your virtual workloads. The “Working Set” is the active footprint (capacity and performance) of those virtual workloads. As an organisation (architects, administrators and operators), you need to make sure that the “Working Set” of your virtual machines have the optimum fit with respect to the “Data Locality” of each node in your HCI solution. Nutanix XCP has many different models, you need to make sure you select the correct fit for your needs.
- Processes and Procedures – Moving from legacy, 3-tier infrastructure to HCI is a big change, so do not underestimate or ignore the imperative to update all of your processes and procedures. HCI will simplify and improve your infrastructure, consequently simplifying your operational procedures, but you will need to change how you do things with respect to people, process and technology.