Thanks to data efficiency, disk is the primary backup target in most data centers. However, with backup, weeks, months and years of retention are kept and therefore the cost of straight disk is untenable. However, with data deduplication, which compares one backup to the next and only keeps the changes at the byte or block level, the amount of disk required is greatly reduced. But there is more to selecting a disk backup appliance with data deduplication than just the data efficiency feature.
Because data deduplication is so compute intensive, a key consideration is how fast the device can ingest data, especially in environments that have a very large baseline data set and/or a high change rate. If data deduplication occurs during the backup process, performance suffers. Furthermore, if all data is stored only in deduplicated form, then the speed of restores and VM boots can be impacted due to a time-consuming data reassembly (or “rehydration”) process.
Few processes in the data center stress the infrastructure like backups, whether nightly incrementals or weekly fulls. Deduplication will not have the same benefit on the initial full backup as it will on future ones. But even after the initial full backup is complete, an enterprise will want to make periodic additional fulls and will, of course, add new data continuously. While the nightly backup data set has less changed data, the overwhelming majority of data centers continue to do full backups of databases and email application environments each night. Since that nightly backup window is much smaller than the weekend window, completing the nightly backup is a greater challenge than completing the weekend full backup. As a result, it is critical that IT professionals understand the speed at which new data can be ingested by the appliance and the resulting impact on the backup window.
Big Backup Math
In 2016, many backup appliance vendors introduced updated systems that have usable capacities approaching 1PB, before factoring in deduplication. This means a single appliance could support a 1PB weekly full backup and still have room to store some number of weekly, monthly and yearly backup jobs (incrementals, fulls) of the same data set. However, some of these systems only support ingest rates of around 30TB to 60TB per hour. For initial data loads, deduplication won’t help with performance, since most of the data is net new to the system. Assuming a low deduplication efficiency rate means it could take the system over 23 hours to complete – best case – which is unacceptable for most enterprises.
There are two reasons why ingest speeds, in some systems, are increasingly falling behind the total capacity of the system. First, many of these systems are designed with a “scale-up” architecture, and they simply do not have enough processor and memory resources to keep up with the inline data deduplication occurring between the backup application and the disk storage. As the data grows, the backup appliance requires even more processor and memory to deduplicate the increased data load, but they only scale by adding disk capacity. Essentially, the system has a fixed, limited amount of processor, memory, and networking resources, which it must use to analyze and compare an ever-growing data set, increasing the time it takes to deduplicate the data and resulting in an ever-expanding backup window.
The Anatomy of Ingestion
For data to be properly ingested by the backup appliance, data has to be written to the appliance’s disk volume and that write has to be acknowledged back to the backup host. In between the point that data is sent to the appliance and the appliance acknowledging that the write has been completed, a whole series of hash table lookups occur and these impact the rate at which the appliance can ingest data.
Since most disk backup appliances are attached to an IP network, the first step is to send the data across the network to the appliance. SMB (and to a smaller extent NFS) is not designed for the types of transfers that are typical of a backup job.
Several backup software vendors, and some hardware vendors, have proprietary backup protocols to address this problem. The challenge is that these customized protocols are specific to that backup vendor and very often require that the disk backup hardware vendor adopt them. For the backup software vendor, it is another way to lock their customers into a solution. While these optimized protocols are important, their value becomes less critical as we move into 10GbE and faster Ethernet bandwidth speeds.
The second step in the ingestion process, and more impactful to the speed at which the backup can be completed, is what the disk backup appliance does with the data after it is received. In some cases, the appliance will deduplicate the data as it comes in. This is known as “inline” deduplication and may impact the data ingestion rate.
Other solutions store the data on disk and then send the acknowledgment back to the backup host before performing data optimization.
”Adaptive” deduplication performs deduplication and replication as the data commits to disk but in parallel with the backup process, and “post-process” deduplication performs deduplication and replication once all the backups are complete. Neither of these approaches should impact the data ingestion rate. The appliances that are adding the ability to deduplicate data immediately after the data lands on disk require adequate CPU and RAM resources. In other words, the appliance always gives priority to the inbound backups.
Vendors that perform inline deduplication need to decide how to deal with the resulting impact on data ingestion rates, and their customers need to live with those decisions. Vendors can do four things to lessen – and potentially eliminate – the impact of inline deduplication on the ingest process. First, they can make sure that their deduplication code is optimized as much as possible. Second, they can use more powerful processors so the deduplication comparison checks can complete faster. Third, they can use extra DRAM in their appliances so that the deduplication metadata can be stored in DRAM, allowing for faster comparisons. Lastly, they can deploy software on the backup application media server to perform some deduplication tasks using the media server CPU. Each of these “investments” on the part of vendors leads to a higher price for the consumer.
The reality is that most backup hardware solutions are not upgradeable, meaning one cannot add extra CPU and memory even if they wanted to. The solution, at least as far as the vendor is concerned, is to buy the next most powerful box in their product portfolio, known as a “forklift upgrade”, and this approach creates numerous problems. First, all of the investment in deduplication knowledge is reset, new backups have to be run, and data must be compared. Second, the old box has to be either upgraded or repurposed.
Deduplication Is Now a Feature
All the extra effort that an inline deduplication appliance requires to maintain an acceptable ingestion rate has to be called into question against the backdrop of modern data protection applications. When disk-based backup appliances first came to market, their primary (and in many cases only) feature was the ability to efficiently store data through capabilities like deduplication and compression. Legacy backup processes typically performed a full backup on a regular basis (weekly or monthly). Backup appliances, most of which were called deduplication appliances at the time, counted on these full backups, which have a lot of redundancy between them, to support claims of 15:1 or greater data efficiency.
Most modern applications do not do full backups as they did in the past. In fact, after the initial seed, some may never do another full backup again. This considerably reduces the effective deduplication efficiency rate to as low as 5:1 in some environments. Disk backup appliances are evolving to add features like scalability and application integration in addition to deduplication.
Deduplication Interferes with Recovery
The opposite of ingestion is restoration, and deduplication impacts restores, VM boots, and recoveries in much the same way it interferes with backup. Like backup, restores, VM boots, and recoveries are similarly evolving, and that makes the overhead of deduplication even more concerning. For small recoveries, the overhead of data rehydration is not noticeable given all of the overhead involved in the restoration of data. But as the size of the recovery grows, the time required to rehydrate deduplicated data becomes proportionally apparent as compared to a backup appliance that does not have deduplication or has a two-stage deduplication process.
Of greater concern is deduplication overhead when using modern data protection applications that can perform “recovery in place”. With recovery in place, the data protection appliance creates a volume, and the application accesses the volume directly. This feature reduces recovery time objectives (RTO) by making the volume available quicker than transferring it across the network. Real-time deduplication complicates this process. First, most backup applications store backed-up VMs in a proprietary backup file or blog. The appliance deduplicates that backup file by comparing it to other backup files that it has stored already. When a recovery in place request is made to a real-time deduplication appliance, the data must first be rehydrated and presented to the backup software. The backup software then creates a volume on the backup appliance and instantiates the VM there. But, again, if the volume is deduplicated in real time, this (now production) volume is being deduplicated and compressed while simultaneously trying to respond to application or user requests. The result is very poor performance, to the point of being unusable.
As datasets grow, traditional deduplication appliances will have an ongoing challenge of maintaining ingestion rates that are in line with the capacities of the appliance. A single system can only support so much inbound data, and the rate at which that data can be received is compounded by how software manipulates and manages that data as it is being received. Big backup environments should look for a more scale-out approach that adds compute, memory, and network bandwidth along with disk capacity. IT should also look for systems that intelligently use deduplication at the right time.
Sponsored by ExaGrid