The capacity of all-flash arrays has been on the rise for several years. Many vendors are now claiming petabytes (PB) of capacity in just a few rack units. While these arrays promise to dramatically reduce the footprint of the data center, the arrays also expose a much larger failure domain. Essentially, if one of these systems goes down the impact will affect more data and more applications. IT planners need to decide if the assured gains in data center efficiency outweigh the potential risks.
What is a Failure Domain?
A failure domain is the scope of impact if a system goes down or goes offline. In the case of a storage system, it is how many applications and how much data are exposed to the failure. The assumption is that all of these applications can be brought back online and data can be recovered. A failure domain is more about scope of the interruption while the efforts to bring applications and servers back online are executed.
What is the All-Flash Failure Domain Problem?
At the heart of the all-flash failure domain problem is the capacity of the flash module. Today, 16TB drives are not uncommon in all-flash arrays and several flash media providers promised 50TB+ capacities before 2018. In addition, most all-flash vendors have capacity optimization technology in their systems that compresses and deduplicates data. Conservatively, these data efficiency techniques get 4 to 1 efficiency ratios making a 50TB drive act like it is storing 200TBs.
Assuming the all-flash vendors adopt the latest high capacity drives and continue to leverage data efficiency features, data center professionals could have the potential of storing dozens, maybe hundreds, of PBs per rack. And remember, this capacity is all-flash so performance will still be in the hundreds of thousands of IOPS.
Best practices suggest the IT administrator would not format all of the available capacity as one gigantic volume. Instead, they would carve it down into more logical partitions. But the rack and the storage system is still a point of failure. If the storage system itself or the networking or power going to the rack were to encounter a problem, an entire data center’s storage and applications may be down.
Dealing with the Failure Domain Problem
There are several ways that IT can address the failure domain problem. The first is to trust the technology. The reliability of all-flash arrays is very high, much higher than hard disk arrays. It is also easier to predict when a flash drive will fail. But these assurances are not 100% and don’t take into account fluke accidents or human error.
The second option is to trust the data protection solution. Make sure that data is being replicated, backed up and can be recovered quickly as to minimize any outage that may occur. Many modern backup solutions can return an application to service within 10 minutes or so after a failure. But this rapid recovery assumes there is a secondary storage system available to act as the primary storage for a while, or the backup storage hardware is adequate to host applications.
The third option is to limit the size of the failure domain. To limit the size of the failure domain IT planners would not take the flash array to full potential capacity. Capacity can be limited by either using less drives or using smaller capacity drives. If the flash array is scale-out in design it means limiting the number of nodes and creating multiple clusters. With this approach applications are split between the multiple systems or clusters, ideally spreading out the mission critical workloads, so a failure of one system does not impact all the mission critical applications.
Any of the domain options are viable depending on the situation. So far flash arrays are very trustworthy and data protection solutions are improving, especially in terms of their ability to recover rapidly. The ultimate assurance is to limit the size capacity per system or per node in a scale-out design.
If the capacity of the storage system is a concern to the organization, then the best approach is to start small, limiting the system size, then allowing the system to grow as confidence in the solution grows. At the same time, IT should continue to invest in data protection solutions that can enable them to rapidly recovery from a disaster.