The Problem with Image Backup for Unstructured Data

Unstructured data presents two challenges to the typical backup process. First, the overall volume of the data set can be the largest data set in terms of capacity that the backup application needs to protect. Second, and potentially more problematic, is the sheer number of files that the backup application needs to protect. Unstructured data can store hundreds of thousands, millions, and in some cases, billions of files. Finding files to backup that are new or have changed since the last backup can take more time than actually transferring the files to backup storage. The time it takes to scan unstructured data leads many backup vendors to use image backups, but image backups of unstructured data have problems of their own.

The Image Backup Advantage

Image backups operate a level below the file system. As a result, these backups are not impacted by how many files the file system is storing. The first image backup is a block by block copy of the volume. Subsequent image backups, assuming the backup software or operating system supports it, are block-level incremental (BLI) backups. BLI backups only transfer the blocks that change when a file is modified or when a user or application creates a new file. Both the full image backup and the BLI backup can happen in a fraction of the time that a file by file scan of the file-system takes.

The Image Backup Problem

The problem with image backups of unstructured data is that these backups lose their granular understanding of the files they are protecting. While most image-based backup solutions do provide individual file recovery, the restoration must come from a known backup job. An administrator can’t, typically, scan for a specific file across multiple image-based backup jobs. Essentially, the only individual restore requests that image-based backups are good for is a recovery from the most recent backup.

Another challenge with image backups is the removal of data from within the image, a capability that many believe is a requirement of new data privacy legislation like the European Union’s General Data Protection Regulation (GDPR) and California’s Consumer Privacy Act (CCPA). Among the extensive requirements to protect and retain data within these regulations are also specific requirements to remove a consumer’s personal information data based on a customer’s request, often referred to as “the right to be forgotten.”

Image backups need all the copies blocks of the volume to be available. Removal of one block from that backup corrupts the file, and the entire backup becomes invalid. Some vendors argue that as long as the backup application removes data belonging to a “forgotten” user as it is restoring other data, then the application complies with the regulations. At this point, however, there is no case law to support that point of view, and for the most part, that point of view conflicts with the regulations.

What Should IT Professionals Do?

IT managers need to reconsider how they are protecting unstructured data. It is a data set that is not only growing in size, it is also increasing in criticality to the organization. IT probably needs to make investments in either advanced, high-speed file by file technology or in the more aggressive use of archive technologies. In our next entry, we’ll discuss the pros and cons of file-by-file backup and how vendors can improve its performance and capabilities to make it a more viable option for unstructured data protection.

Storage Switzerland and Aparavi will hold an in-person presentation called, “Are you Treating Unstructured Data like a Second Class Citizen.” You can register here. Attached to the presentation is an exclusive white paper entitled “It’s Not IF your Backup Software is Using Cloud Storage, It’s HOW!” In the article, we cover another challenge with image-based unstructured data backup, inefficient use of cloud storage, which needlessly forces the growth of the on-premises data storage footprint. As soon as the presentation starts playing, you can download this valuable asset.

Sign up for our Newsletter. Get updates on our latest articles and webinars, plus EXCLUSIVE subscriber only content.

George Crump is the Chief Marketing Officer at VergeIO, the leader in Ultraconverged Infrastructure. Prior to VergeIO he was Chief Product Strategist at StorONE. Before assuming roles with innovative technology vendors, George spent almost 14 years as the founder and lead analyst at Storage Switzerland. In his spare time, he continues to write blogs on Storage Switzerland to educate IT professionals on all aspects of data center storage. He is the primary contributor to Storage Switzerland and is a heavily sought-after public speaker. With over 30 years of experience designing storage solutions for data centers across the US, he has seen the birth of such technologies as RAID, NAS, SAN, Virtualization, Cloud, and Enterprise Flash. Before founding Storage Switzerland, he was CTO at one of the nation's largest storage integrators, where he was in charge of technology testing, integration, and product selection.

Tagged with: , , , , , , , , , ,
Posted in Blog

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 25,542 other subscribers
Blog Stats
  • 1,897,451 views
%d bloggers like this: