Is Deduplication Useless on Archive Data?

Posted on December 7, 2015 by George Crump

One of the techniques that storage vendors use to reduce the cost of hard disk-based storage is deduplication. Deduplication is the elimination of redundant data across files. The technology is ideal for backup, since so much of a current copy of data is similar to the prior copy. The few extra seconds required to identify redundant data is worth the savings in disk capacity. Deduplication for primary storage is popular for all-flash arrays. While the level of redundancy is not as great, the premium price of flash makes any capacity savings important. In addition, given the excess performance of AFAs the deduplication feature can often be added without a noticeable performance impact. There is one process though where deduplication provides little value; archive. IT professionals need to measure costs differently when considering a storage destination for archive.

Why Deduplication Fails in Archive

When correctly implemented, an archive should be where the last known good copy of data is stored. The file that is stored in the archive is typically unique and is often only the final version of a file that is stored. Given that the archive storage target is not going to be flash, the low archive cost per GB means that the redundancy rate would need to be fairly high to justify deduplication. While there is some potential for redundancy between the various files stored in an archive, it should be relatively low.

A second and maybe a third copies of data stored in the archive are sometimes created, but these copies are designed to maintain chain of custody and to recover archive data in the case of a disaster. In both cases, these copies must be separate standalone data copies and can’t be part of a deduplication process.

A Different Storage Calculation

The lack of an effective result from deduplication means that storage costs for archive data need to be calculated very differently. The only efficiency technique that will work is compression, and only in some cases. The lack of an effective efficiency technology means that storage costs need to be based on raw capacities not effective capacities that some vendors feature in their brochures.

Basing storage costs on raw capacity is relatively easy but it is more than taking the cost of the storage media and dividing it by the amount of capacity. The “system” cost needs to be calculated. For a disk based archive, this means the storage controller hardware and disk shelves that will surround the actual hard disk media. In some cases empty disk shelves can be more expensive than the drives that go in them. For tape libraries, the cost of the library and its drives need to be calculated into the price.

Comparing Disk to Tape

Without the aid of deduplication, the comparison of disk and tape for long term archiving is more interesting (and realistic). Both disk and tape will have long upfront costs associated with them, the cost of the storage controller and the empty library. A hard disk-based archive will add the cost of additional shelves and each piece of disk media has a fair amount of technology on it. A tape archive will need to factor in the cost of buying enough tape drives to sustain writes to, and restores from, the archive, but the media has almost no technology associated with it and its cost per GB is relatively low.

Assuming that the upfront costs for a disk or tape based archive is similar, the cost to scale the disk archive, thanks to more expensive shelves and media, will outpace the costs to scale a tape based archive.

Conclusion

The lack of redundant data makes deduplication almost useless when considering archive capacity costs. Raw capacity is more critical and as these archives scale a disk-only approach becomes harder to cost justify. For optimal cost containment, a small disk front end that does not need to scale should be leveraged with a large scalable tape back end.

Sponsored by Fujifilm Dternity, Powered by StrongBox

About George Crump

George Crump is the Chief Marketing Officer at VergeIO, the leader in Ultraconverged Infrastructure. Prior to VergeIO he was Chief Product Strategist at StorONE. Before assuming roles with innovative technology vendors, George spent almost 14 years as the founder and lead analyst at Storage Switzerland. In his spare time, he continues to write blogs on Storage Switzerland to educate IT professionals on all aspects of data center storage. He is the primary contributor to Storage Switzerland and is a heavily sought-after public speaker. With over 30 years of experience designing storage solutions for data centers across the US, he has seen the birth of such technologies as RAID, NAS, SAN, Virtualization, Cloud, and Enterprise Flash. Before founding Storage Switzerland, he was CTO at one of the nation's largest storage integrators, where he was in charge of technology testing, integration, and product selection.

Tagged with: Backup, Cloud, Crossroads, Disaster recovery, Disk, dr, Fujifilm, Hybrid, Tape
Posted in Blog

3 comments on “Is Deduplication Useless on Archive Data?”

Louis Imershein says:

December 7, 2015 at 11:58 am

Two observations from back in my disk-based archiving days:

1) Archives are rarely implemented correctly. In many cases, it can be easier to dedupe than fix the archiving process.

2) The benefits of deduplication are extremely data dependent. You have to ask the question, “what are you archiving?”
Brad Johns says:

December 7, 2015 at 5:18 pm

George, I enjoyed your article on deduplication on archive data. As you mention, archive data may need to be kept for years , perhaps decades. One considerable problem with using deduplication given the time requirements is the algorithms used are proprietary and specific to the software/appliance being used. This creates a long term dependency on the vendor that can be both risky (will the vendor still be around?) and possibly expensive (requires the purchase of newer versions of the deduplication solution).
Louis Imershein says:

December 11, 2015 at 2:58 pm

This is why hardware-only data reduction solutions don’t work for long-term storage. The clear solution to the problem Brad mentions is to be able virtualize the data reduction software itself. That way it can be stored alongside the data – software-defined storage to the rescue!

Comments are closed.