How Big Backup Impacts Deduplication

Posted on March 20, 2012 by George Crump

“Big Backup” is an enterprise challenge where millions (or billions) of files have to be processed each night so they can be safely and cost effectively stored on backup devices. These kinds of backup jobs not only have large numbers of files or objects associated with them but they can also be large in overall size. Big backup impacts every aspect of the backup process but especially the hardware that backup lands on.

Big Backup is Different

Big backup environments are already being optimized by the upgrade of backup network infrastructure and backup software. These environments have long ago switched to a mode of doing mostly incremental backups with an occasional full backup rollup. They are mature environments that can deliver a backup payload very quickly and much of that backup data tends to be unique.

In these environments deduplication, while important, has less of a return on investment than in the smaller data center and it can “get in the way” of the primary goals for big backup; fast backups and recoveries. Greater control over what gets deduplicated and what does not may be required in big backup environments.

This leads to a fundamentally different requirement for the backup device. Like most environments these data centers are looking to leverage disk as at least the initial backup target and in some cases, the only backup target. They’re also typically looking at their third implementation of disk backup devices, having outgrown prior systems.

While they want to leverage deduplication, big backup environments need greater control and granularity over what gets deduplicated and when. They need the ability to make their own cost benefit analyses on when to apply deduplication and when not to.

This combination of factors is driving the search for a hardware backup target that’s better able to handle the big backup challenge. The backup manager must find a solution that will not, once again, need to be prematurely refreshed in a few years. They need a solution that will balance storage efficiency with backup performance. As SEPATON’s CTO Jeff Tofano discussed in a recent white paper the backup manager needs to be armed with the right questions to make sure that the next backup system purchased is the right one.

The Impact of Storage Efficiency

Deduplication has become almost a checkbox feature in disk backup hardware and many of these vendors hope that users assume all deduplication is the same. For example, many vendors will try to avoid a discussion of how deduplication impacts performance. Several strategies exist to deduplicate data and all have their pros and cons. While the performance impact of deduplication may not be noticeable to the application or user, it still exists. Through extra processing power, more efficient code or lower prioritization of the process, they have hidden the negative effect that deduplication has on performance. Big Backup often exposes these attempts to hide this performance impact.

What’s more interesting and more important for the backup manager to understand is the impact of deduplication over the course of time. As increasing amounts of data are injected into the backup process the metadata (information about the data) that deduplication uses to track redundant data can get large and cumbersome. The disk backup hardware device may become overwhelmed by the management of that metadata and overall backup performance may begin to suffer.

Vendors have tried to avoid this problem by artificially limiting the sizes of the data sets stored on a device. In Big Backup these artificial limitations lead to backup appliance sprawl, where managers are left load balancing multiple backup targets manually. Even some backup devices that have scale out architectures limit the deduplication comparison to one node at a time. Doing so gives them control over deduplication metadata size but reduces storage efficiency and complicates management.

Some vendors have even begun to implement solid state drives to store the metadata tables so that lookup and compare times are reduced. While this can be viable in a primary storage environment, doing so for backup, which is under constant cost pressure, can be a problem.

The challenge with understanding the long term impact of deduplication on the backup process is that it’s difficult to simulate this condition in a typical 30-45 day evaluation cycle. It can take over a year for there to be enough object data in the deduplication metadata database for performance to be impacted. Unfortunately, once this happens performance can degrade quickly.

This metadata tracking and the way deduplicated data is scattered across the storage device can also impact restoration times if deduplication is not properly implemented. When a file needs to be recovered it needs to be reassembled from the multiple parts that deduplication divided it into and the multiple areas of disk(s) that are storing those parts. As the deduplication process ages, that dispersion get wider and the recovery performance may worsen.

Big Backup Deduplication Requirements

The value of deduplication is undeniable but those benefits shouldn’t come at the price of reduced backup or recovery performance. The goal in big backup architectures, first and foremost, is to meet backup and recovery objectives. As stated earlier, they have the backup network infrastructure and the client horsepower to be able to deliver large backup payloads very quickly. In our next entry we will look at how disk based backup appliances like SEPATON’s address big backup deduplication requirements.

SEPATON is a client of Storage Switzerland

About George Crump

George Crump is the Chief Marketing Officer at VergeIO, the leader in Ultraconverged Infrastructure. Prior to VergeIO he was Chief Product Strategist at StorONE. Before assuming roles with innovative technology vendors, George spent almost 14 years as the founder and lead analyst at Storage Switzerland. In his spare time, he continues to write blogs on Storage Switzerland to educate IT professionals on all aspects of data center storage. He is the primary contributor to Storage Switzerland and is a heavily sought-after public speaker. With over 30 years of experience designing storage solutions for data centers across the US, he has seen the birth of such technologies as RAID, NAS, SAN, Virtualization, Cloud, and Enterprise Flash. Before founding Storage Switzerland, he was CTO at one of the nation's largest storage integrators, where he was in charge of technology testing, integration, and product selection.

Tagged with: Backup, Data deduplication, ExaGrid, Sepaton, Storage
Posted in Article

One comment on “How Big Backup Impacts Deduplication”

Scale out NAS Backup for Scale out Storage | Storage Swiss - Storage Switzerland says:

December 11, 2013 at 11:54 am

[…] has made their name with one of the highest performing deduplication solutions on the market, using ‘ContentAware’, byte-level deduplication. This modified post-process […]

Comments are closed.