How and Why to Develop a Robust File Retention Strategy

Posted on December 4, 2019 by George Crump

Backup applications, generally speaking, store the data they protect for a given amount of time. Increasingly, the default setting for most organizations is forever. While we are not opposed to a forever retention strategy, long term retention should not be a part of a backup. Backup data should have a specific retention time, and most organizations set it for far too long.

What Needs Retention?

Fundamentally, all data needs a retention time associated with it, but the dataset that is the most problematic is unstructured or file data. Databases, while important, are reasonably self-contained. Organizations quickly understand that having 365 versions of a database is both impractical and unnecessary. Unstructured data, by definition, is less contained, and different types of files have different retention requirements.

Why Manage Retention?

There are legal reasons to retain certain types of data for a given period. Organizations have to prove they are meeting these regulations or face fines. There are also organizational best practices that require a specific data retention period. A set retention policy that holds up to legal requirements and corporate best practices enables an organization to justifiably delete data after the retention time has passed. Doing so can protect the organization from future lawsuits and save the organization time in responding to e-discovery and regular recovery requests. For example, imagine if the organization didn’t have a set policy, or decided it would keep all data in backups for twenty years. Legally, an opposing organization or group can file an e-discovery request requiring the organization to deliver backup data that is twenty years old. Imagine trying to find and restore twenty-year-old data.

Another reason to maintain a set retention policy is to save money. Storage is inexpensive, but it isn’t free. It costs money to store data for seven years, and reducing the amount of data the backup is saving can significantly reduce the infrastructure investment.

Why Are Backup Retention Times So Long?

There are specific data that organizations need to retain within backups for a particular time. For example, seven years is typical for financial data, and the “life of the patient” is ordinary in healthcare. However, as a total percentage of backup storage, these requirements should consume a relatively small amount of storage space, yet most organizations store all backups to seven years or some other arbitrary number. They are treating all backup data the same, which leads to massive backup storage repositories.

The Backup Retention Problem

The primary reason for these long backup retention times is the backup application’s lack of a granular understanding of the data that it is protecting. Most backup solutions backup data as jobs, and give all the data within that job the same retention time. There is no option to set specific files within the backup job to be retained for one period and other files to a different period. Some applications have a sophisticated enough metadata database to know what files are in the job, but they can’t assign different retention policies on individual files. Instead, the app sets retention based on the age of that iteration of the job.

Part of the “job focus” of backup applications came from their original dependence on tape technology. It was faster to batch up a bunch of small files and send them all at once to the tape device. Also, it is virtually impossible to set different retention policies for different data types on a single tape. In tape technology, the entire tape has to age before IT can discard it. Some applications can condense tapes by copying only the jobs with active retention to a new tape and then erasing the original tape.

In backup where disk is now the primary target, aging on specific files should be possible. The problem is that most modern backup applications backup at an image level. The data within those images must remain intact, or the entire image is invalid. They also don’t tend to have a cross-job understanding of where files are. The administrator has to know what job has the desired file and then can search that job for that file.

The irony of organizations selecting a retain forever (or a long time) strategy, is they often do so because of the lack of flexibility in their data protection product. That same data protection product often makes it hard, if not impossible, to find data within a seven-year-old data set.

The Archive Retention Problem

Another challenge is that most organizations don’t have an archiving solution that runs alongside their data protection solution. An archiving solution enables the organization to reduce their backup retention to a few months, dramatically reducing the infrastructure requirements of the backup process. The archive also generally has better search and granular data retention policy capabilities. Convincing organizations to create an archive strategy, though, is a sermon for another Sunday. The reality is that organizations need a way to improve their backup retention capabilities.

Solving the Backup Retention Problem

The critical area of concern is when backup needs to store unstructured data. Instead of performing unaware, image backups, IT needs to look for solutions that can provide fast, file by file backups. Then the application needs to store the backup data in such a way that retention policies can be applied based on file type or even individual files. If the backup solution has this file-by-file visibility, it can also add content indexing and data classification. It, in essence, provides all of the attributes of an archive solution other than the actual migration of data, which tends to be the area of most profound concern among IT professionals. These file aware data protection solutions could themselves add a migration capability, making them full-fledged archive solutions.

Learn More!

In our upcoming webinar, Storage Switzerland and Aparavi will discuss dealing with file retention challenges and their impact on meeting regulations like GDPR, CCPA, and HIPPA. To learn more register for the event “Are You Treating Unstructured Data as a Second Class Citizen?“.

As our thank you for watching the event you’ll also have access to our latest white paper “It’s Not IF your Backup is Using Cloud Storage it’s HOW,” in which we discuss how the same techniques that provide excellent file retention also enable optimal use of cloud storage.

Click To Register

Sign up for our Newsletter. Get updates on our latest articles and webinars, plus EXCLUSIVE subscriber only content.

About George Crump

George Crump is the Chief Marketing Officer at VergeIO, the leader in Ultraconverged Infrastructure. Prior to VergeIO he was Chief Product Strategist at StorONE. Before assuming roles with innovative technology vendors, George spent almost 14 years as the founder and lead analyst at Storage Switzerland. In his spare time, he continues to write blogs on Storage Switzerland to educate IT professionals on all aspects of data center storage. He is the primary contributor to Storage Switzerland and is a heavily sought-after public speaker. With over 30 years of experience designing storage solutions for data centers across the US, he has seen the birth of such technologies as RAID, NAS, SAN, Virtualization, Cloud, and Enterprise Flash. Before founding Storage Switzerland, he was CTO at one of the nation's largest storage integrators, where he was in charge of technology testing, integration, and product selection.

Tagged with: Aparavi, Archive, CCPA, Cloud, GDPR, Purpose Built, Ransomware, Snapshot, Unstructured data, Virtual machine, Virtualize
Posted in Blog