Unstructured data presents two challenges to the typical backup process. First, the overall volume of the data set can be the largest data set in terms of capacity that the backup application needs to protect. Second, and potentially more problematic, is the sheer number of files that the backup application needs to protect. Unstructured data can store hundreds of thousands, millions, and in some cases, billions of files. Finding files to backup that are new or have changed since the last backup can take more time than actually transferring the files to backup storage. The time it takes to scan unstructured data leads many backup vendors to use image backups, but image backups of unstructured data have problems of their own.
The Image Backup Advantage
Image backups operate a level below the file system. As a result, these backups are not impacted by how many files the file system is storing. The first image backup is a block by block copy of the volume. Subsequent image backups, assuming the backup software or operating system supports it, are block-level incremental (BLI) backups. BLI backups only transfer the blocks that change when a file is modified or when a user or application creates a new file. Both the full image backup and the BLI backup can happen in a fraction of the time that a file by file scan of the file-system takes.
The Image Backup Problem
The problem with image backups of unstructured data is that these backups lose their granular understanding of the files they are protecting. While most image-based backup solutions do provide individual file recovery, the restoration must come from a known backup job. An administrator can’t, typically, scan for a specific file across multiple image-based backup jobs. Essentially, the only individual restore requests that image-based backups are good for is a recovery from the most recent backup.
Another challenge with image backups is the removal of data from within the image, a capability that many believe is a requirement of new data privacy legislation like the European Union’s General Data Protection Regulation (GDPR) and California’s Consumer Privacy Act (CCPA). Among the extensive requirements to protect and retain data within these regulations are also specific requirements to remove a consumer’s personal information data based on a customer’s request, often referred to as “the right to be forgotten.”
Image backups need all the copies blocks of the volume to be available. Removal of one block from that backup corrupts the file, and the entire backup becomes invalid. Some vendors argue that as long as the backup application removes data belonging to a “forgotten” user as it is restoring other data, then the application complies with the regulations. At this point, however, there is no case law to support that point of view, and for the most part, that point of view conflicts with the regulations.
What Should IT Professionals Do?
IT managers need to reconsider how they are protecting unstructured data. It is a data set that is not only growing in size, it is also increasing in criticality to the organization. IT probably needs to make investments in either advanced, high-speed file by file technology or in the more aggressive use of archive technologies. In our next entry, we’ll discuss the pros and cons of file-by-file backup and how vendors can improve its performance and capabilities to make it a more viable option for unstructured data protection.
Storage Switzerland and Aparavi will hold an in-person presentation called, “Are you Treating Unstructured Data like a Second Class Citizen.” You can register here. Attached to the presentation is an exclusive white paper entitled “It’s Not IF your Backup Software is Using Cloud Storage, It’s HOW!” In the article, we cover another challenge with image-based unstructured data backup, inefficient use of cloud storage, which needlessly forces the growth of the on-premises data storage footprint. As soon as the presentation starts playing, you can download this valuable asset.