One of the riskiest claims a vendor ever makes is “we’ve eliminated the need for backup” and hyperconverged infrastructure (HCI) vendors make this claim frequently. Good data resiliency is not good backup, and in fact, some of the work that vendors do to make these claims actually hurts what the system was really designed to accomplish.
How Can HCI Solutions Claim Self-Protection
Built on a group of servers, HCI solutions typically cluster the servers together via a hypervisor. Within that cluster runs a software defined storage (SDS) solution. Part of the storage software’s responsibility is to distribute data across the cluster for resilience and mobility.
Today, most HCI solutions achieve this resiliency via a technique called erasure coding (EC). Erasure coding segments data while writing and distributing it across the servers (nodes) within the cluster. Part of the segmentation is the creation of parity bits, and placing them on one or more nodes. The number of parity bits roughly equates to the number of nodes that can fail prior to the HCI solution facing an outage or data loss.
At this point in the protection process, no HCI vendor will claim they’ve eliminated the need for backups, or at least they shouldn’t. What gives the HCI vendor the guts to make the “eliminate backup” claim is what comes next. The storage behind most HCI solutions can take, in most cases, nearly unlimited snapshots. This high number of potential snapshots provides the solution with a basic point-in-time capability. The customer can, in theory, roll back a virtual machine to any point in time as long as there is a snapshot available. But, even with all these snapshots the solution is still susceptible to a disaster impacting the entire system or site.
The second HCI confidence booster is asynchronous replication of those snapshots. The HCI solution collates its snapshots and periodically sends the delta between snapshots to another location. Additionally, in many cases the solution can take a different set of snapshots of the landed data. At this point, many HCI solutions consider the data protected and make their claim of “eliminating backup”.
Some HCI solutions will take the protection claim a step further and claim that they can actually create zero capacity impact backups on the cluster. The customer actually makes a copy of data, but within the cluster that copy is deduplicated, and since it is deduplicated, the copy requires almost no additional storage space. In some ways, this copy is better than a snapshot since the backup copy is not dependent on the primary data volume as a snapshot would be. But, the deduplicated copy is totally dependent on the deduplication metadata staying intact, so there is a degree of vulnerability.
Cluster + Erasure Coding + Snapshots + Replication = Backup?
While it is certainly true that HCI solutions deliver a lot of redundancy and data resiliency, none of these protection steps that HCI offers provides a true backup. A true backup is a secondary copy of data on a completely separate storage infrastructure, preferably running separate storage software and on a different type of media.
The Problems with HCI “Backup”
The first problem with “HCI Backup” is that it only protects data within the HCI environment. Most data centers are not “all in” with their HCI solution, and in fact, most HCI installations, at least to this point, target specific projects, so the reality is that most of the environment is not within the HCI environments.
The second problem with HCI backup is that it is completely dependent upon the storage software. If a problem occurs in the software, data is exposed. A very real example is the deduplication process. Deduplication works by identifying and removing redundant segments of data. To work, deduplication requires a fairly sophisticated metadata table. If that table corrupts or has a failure and there is no separate backup copy of the table, then more than likely, it will be necessary to recreate all that data.
The third problem is not unique to HCI but to any solution or organization that is going to count on snapshots as their primary point of recovery. Most snapshots appear to the system as another copy of a volume. The software that takes the snapshots does not provide any ability to find specific data within the snapshots and certainly does not provide any ability to find data across snapshots. This limitation can be problematic if for example an administrator is looking for a specific version of a specific file that may be in one of one hundred, or thousand, snapshots.
A final problem with HCI as backup is that it requires that the customer standup an HCI environment at both the primary data center and the secondary/DR site. Customers may just want to replicate data to a central repository at the DR site and only instantiate certain applications during a disaster. Again, keeping in mind that most organizations are not fully committed to HCI, standing up a second cluster to support applications that more than likely will not be the first ones recovered in a disaster, is a waste of DR budget.
In the end, enterprise backup still makes the most sense for the majority of customers when deciding how to protect their HCI environments. While an organization that is 100% on a single HCI infrastructure may get away with HCI as backup, most organizations will find the problems with lack of copy diversity, inability to find data and the additional costs at the DR site to be too problematic for them to consider seriously this option.
Instead, customers should look for enterprise solutions that provide direct support for the HCI solution. The advantage is then they can leverage a single backup solution to protect their entire data center while avoiding the need to run a different data protection application for each unique environment.