A hyperconverged Infrastructure (HCI) is made up of servers, storage media, storage software as well as networking software and hardware. Of these components, the servers are often the most overlooked. The assumption is that a server is a server. In fact, many HCI vendors are proud to state they use commodity, white box servers. The truth is that using generic server technology may actually cost the customer more money than investing in quality server nodes.
The Role of a Node
A server node provides the computing power to the virtual machines that the node hosts. It also provides computing power to the storage software, which depending on the features and the level of IO intensity, can consume a significant amount of the processor’s resources. In addition to computing power the server node also houses the storage capacity that it will contribute to the HCI cluster. Additionally, it also provides two forms of network connectivity. First, it provides network connectivity to the virtual machines so they can access, or be available to, the rest of the data center. Second, a network connection is used to distribute data to other nodes in the cluster.
As the environment scales and more workloads are added to the HCI cluster, some of the node’s resources are consumed until exhausted, forcing expansion. The typical HCI vendor’s answer to running out of resources is to just add another node, exploiting the scale-out nature of HCI.
The Problem with “just add a node” Thinking
While the “just add a node” philosophy sounds easy enough there are several problems with it. First, there is the physical problem of fitting the new node into a space constrained data center. In most cases other servers need to be moved and network connections re-routed. There is also the problem of rebalancing the cluster’s compute and storage responsibilities. Administrators must decide which virtual machines should be moved to the newly available node as well as migrating the data for that virtual machine, to the new node. While an erasure coding storage strategy removes the responsibility of data movement from IT, it adds the possibility of an IO storm while data is rebalanced between node clusters.
Third, the node’s resources are typically locked in, meaning the customer gets a prescribed amount of storage and compute capacity per node. The existing nodes are typically limited in terms of compute and storage capacity. If the current cluster just needs capacity or just compute, the customer can’t typically upgrade a node in place. Eventually the customer ends up with an HCI cluster with either too much compute power or too much storage capacity.
The Node Matters
In HCI, especially as it scales and adds more storage intensive workloads, the node hardware matters. IT should look for several capabilities in their node hardware. The first is plenty of CPU performance to drive VMs and to drive the storage software. Second, the node should scale capacity internally as well as externally to make sure each node is maximized to its full potential. The goal of the HCI hardware should be to “just add a node” only when all of the existing nodes reach maximum capacity.
Lastly, and most importantly, the node should deliver maximum performance. Today, delivering maximum performance means using NVMe flash drives. The use of these drives in an HCI cluster means that a larger number of and greater variety of workloads, can be supported per node. It also allows for using higher density flash drives.
The first generation of HCI targets specific use case like VDI. They also use rudimentary servers that run out of resources quickly, forcing IT to add more nodes more frequently. Also, since a virtual machine must run intact on a single node, adding nodes for additional CPU power only works if there are other virtual machines to off-load to the new node.
IT professionals should look for server nodes with plenty of computing power but also have the ability to provide high capacity and high storage performance. An NVMe based system enables IT to stack many more VMs in a given node as well as to handle a more diverse workload set on the node without concern about storage IO bottleneck.