Data intensive workloads like Elastic, Hadoop, Kafka and TensorFlow, are unpredictable, making it very difficult to design flexible storage architectures to support them. In most cases, scale-out architectures utilize direct attached storage (DAS). While DAS delivers excellent performance to the workloads on the server in which they are installed, the lack of a shared resource limits efficiency and eliminates any theoretical advantage they may have had. A traditional enterprise shared storage architecture though introduces high costs, complexity and latency that impacts the most critical element of the data intensive workload, performance.
Scale-Out: Clustered but Siloed
Most hyperscale applications exist in a cluster but the nodes within the cluster don’t typically share data, at least not in real-time; most of the time the hyperscale application replicates data between “n” number of nodes. The primary purpose of replication is data protection, not workload mobility. This is a storage resource challenge but the architecture is actually a bigger problem for efficient CPU use.
Computing power in hyperscale systems that use a DAS approach are locked to the storage attached to it. Nodes are typically deployed with all drive bays filled with whatever is deemed the most appropriate storage technology at the time. This “lock-in” means that if the workload needs more performance it can’t be quickly added without adding additional unneeded storage. The lack of the ability to add, remove and re-use nodes easily is becoming a greater concern as GPUs become more prevalent. While these GPUs can significantly minimize processing time, they are much more expensive than typical CPUs. They need to be globally available so that multiple workloads can be transferred to them, or better have the GPUs transfer to the data. The goal should be to transfer CPU and GPU resources to the data so that data movement is minimized.
Counting on internal node storage also means that the capacity available to each application is limited. Organizations are forced to purchase larger than required servers to meet capacity demands or need to add additional nodes to the cluster and divide up processing even further. In either case, the data center floor space required for the scale-out cluster is significantly higher than if a common pool of storage can be established.
Using a DAS approach typically forces the organization to overprovision by a factor of three. It needs to make sure each node has maximum compute, storage performance and storage capacity, instead of just allocating the actual resources it needs when it needs it. The locking of these resources also means that upgrading any one of them means unnecessarily upgrading the other aspects of the node.
Given the high performance of flash-based storage technology, most organizations will upgrade processor technology multiple times prior to needing to upgrade storage performance. Again, the problem each time processors are upgraded is that the DAS approach all but forces the upgrade of the drives.
Another challenge with the DAS architecture is a lack of rapid recovery from hardware failure. While most hyperscale software builds in replication, those replication targets are preset and require additional storage capacity and network bandwidth. If there is a node hardware failure, the application needs to be restarted on another server but needs to reattach across the network to the data on that node.
If the failure is the more common storage-based, then the node still must be moved to another predesignated replication partner even though the compute and networking components of the node are 100% operational. The same decisions about where each workload will run also need to be made.
The answer to these challenges is some form of shared storage. The problem is that traditional shared storage systems add significant latency and cost. They don’t bring the resources to the data. Shared storage also doesn’t provide the orchestration to move or point nodes at different workload data sets based on need. In our next blog, Storage Switzerland will dive deeper into the challenges of using traditional shared storage infrastructure for hyperscale architectures.
In the meantime, register for our live 15 minute webinar “Composing Infrastructure for Elastic, Hadoop, Kafka and Splunk” and receive a copy of Storage Switzerland’s eBook “Is NVMe-oF Enough to Fix the Hyperscale Problem?”