Hyperconverged Infrastructures (HCI) promise simplicity; simpler management and simpler scaling. One of the challenges though is that as an organization scales it also starts to expect specific response times from specific applications. Priorities set in, and IT must assure the organization the performance of a certain application will meet those expectations no matter what else is going on in the infrastructure. IT usually counts on quality of Service (QoS) features to meet these expectations, but meaningful QoS is something that HCI has a hard time delivering.
The HCI Bottleneck – Compute Consumption
The reason behind the development of virtualization was due in part to the under utilization of CPU resources. It enabled multiple applications to run, safely, on the same CPU. The two major components of an HCI solution are the storage software and the cluster manager.
The storage software provides all the features that IT has come to expect from storage solutions like media failure protection, snapshots, replication, deduplication and compression. Since each node in the hypervisor cluster will also be a node for the storage software, the software feature set of these nodes needs managing, which is the job of the HCI’s cluster manager. It makes sure the right data is on the right nodes to support both media failure protection and application mobility (migration of VMs between nodes).
While pre-virtualization servers had plenty of CPU resources available to them, post-virtualization servers (now nodes) do not. They are loaded down, running potentially dozens of virtual machines per node. Adding storage software and an additional layer of cluster management is not helping matters. Consider also the increased expectations on the storage software. Functions like deduplication, compression and replication are all processor intensive.
QoS Guarantee? Not that easy with HCI
Some HCI vendors promise QoS but what they are really providing is a prioritization. Instead of assuring that a particular application gets 20,000 IOPS the software, in actuality, is only able to ensure the application receives a high priority of the available IO performance based on what is available at the time of its request. Application owners need greater specificity than this especially considering that multiple applications may have high priority needs. There is nothing within the HCI software that reserves IOPS performance for a specific application.
IT wants to be able to hard allocate IO resources when it needs to. It wants and needs to be prepared to assure that application X will get Y level of performance no matter what else is happening in the cluster. HCI can only make sure the priority applications are getting X% of the remaining performance after all the hypervisor, storage software and cluster management process have had their needs met. Essentially HCI solutions, if they provide any QoS, only provide a high, medium, low functionality not a specific IOPS assurance.
The final concern is apparent with the recent exposure of Intel’s Meltdown/Spectre bugs. To protect the environment from these bugs, operating system and hypervisor vendors are releasing patches and updates to their core operating environments. Each of these updates impacts overall system performance and specifically some aspect of storage performance. Most hyperconverged architectures create a file system to aggregate the storage capacity from the various nodes in the hypervisor cluster. Each IO generates a syscall, which is directly impacted by the patches designed to protect against the two bugs.
The HCI Performance Band-Aids
The reality is that hyperconverged architectures are a sophisticated set of interrelated software and hardware and isolating these components to guarantee specific performance objectives is almost impossible. To work around these challenges, HCI vendors tend to suggest that customers buy far more CPU power than they actually need. Essentially these vendors are telling customers to waste resources so potential CPU resource conflicts will not impact them.
Another workaround is all-flash HCI designs. The problem is this approach, while increasing the performance of all the applications it supports, also dramatically increases cost and is not necessarily the best platform for data retention. It is also overkill for the majority of data stored within the HCI design. Additionally, much of the performance of flash is lost in a hyperconverged design because data has to be written out across a very busy network, which adds latency.
The Answer to HCI Performance Challenges
The answer to HCI performance challenges can be found in the past; a separation of the storage and compute tiers, each getting its own dedicated processing power. HCI claims dedicated tiers are too complicated. However, instead of throwing out an entire design because of complexity, maybe the answer is to just simplify the components within it. In the two-tier architecture case, that means simplifying the storage systems within it.
Sponsored by Tintri