Unstructured data, more because of the sheer number of individual files than the capacity it consumes, creates challenges for IT professionals looking to protect it. To get around the problem of massive file growth many data protection software solutions have resorted to image backups, which ignores the files and just backs up the entire volume at a block level. The problem with the image backup approach, as we describe in our previous blog, is image backups lose file granularity and understanding, which makes restoration and intelligent retention significantly more difficult.
The Problem with File-By-File Backup
Since the inception of file servers, the common method of protecting the data they store was for the backup software to log into the file server and “walk” the file system. Walking the file system means the backup software solution “crawls” the file-system file by file to identify what files are new or modified and need protecting. The walk time can be so time consuming and the process may be so resource intensive that it takes longer to complete than it does to transfer the data to backup storage.
The unprecedented growth of unstructured data, both in terms of capacity and file count, continues to make the situation worse each year. It is also why most unstructured data stores are protected via an image backup, despite the significant challenges with recovery, retention and meeting data privacy regulations. These challenges are so problematic that file-by-file protection may prove a better option for organizations. In order to make the case for file-by-file backup, a data protection vendor needs to improve its product performance, and deliver even greater value than traditional backup because of its granular understanding of the data it is protecting.
Improving File-By-File Backup Performance
The primary challenge with most file-by-file backup techniques is software developers created them at a time when the quantity of files was not a concern. As file counts grew, file system walk times also grew to the point the technique became unusable. There are several methods to improve file-by-file backup performance. Vendors can simply build better algorithms that can move through a file system significantly faster than legacy approaches. They can also leverage a logging system that tracks when a file is created or modified. Then, when a backup needs to occur, the log has the file names and their exact locations, so no walk of the file system is required. Finally, they can use a snapshot like approach that provides a file-granular method of copying only changed data.
In some cases the initial backup still takes longer than an equivalent image-based backup but subsequent backups are almost as fast as the image approach. Vendors may use a combination of the above techniques to deliver the fastest possible backup without sacrificing the granular understanding of the data they are protecting. In the end the customer shouldn’t care, as long as backup windows are met.
The Value of Fast File-By-File Backup
Another requirement in making file-by-file backup the primary method of protecting unstructured data is to add value beyond simply creating a copy of the data on a secondary storage device. The advanced file-by-file solutions enable organizations to directly address modern concerns with unstructured data. For example, most organizations have a particular set of files that have special requirements and other files that have almost no value after the initial backup. The problem with image backups is that organizations can’t set different policies per file or file type. The image is one big blob, and the retention policy has to be set on the “blob” not on the files within that blob.
File-by-file backup solutions can also add value by enabling organizations to better search for data within the backup set. Most image backup solutions can restore individual files within the image blob. These solutions though, don’t typically provide search across jobs or blobs. A file-by-file solution can provide file-by-file search across multiple backups. It can indicate, for example, how many copies of the same file are present within the same backup. File-by-file solutions can go further and provide capabilities like data classification and contextual search. Imagine being able to find what data within your backups has credit card numbers within it.
Finally, a file-by-file solution may also enable the organization to better meet data privacy regulations like the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA) which upon a user request, require the deletion of that user’s data, including from backups. Most image solutions can’t adhere to these policies. Deleting a file within them corrupts the blob. The only option is a delete on restore capability, which most don’t have and it is unclear if the governing bodies support this approach.
A file-by-file solution can have specific data removed from the backup set without impacting the integrity of the rest of the backup data set. It meets data privacy policies directly and doesn’t require crossed-fingers hoping that the legislating bodies agree that backups deserve special treatment. Early indications are that they see no differentiation between primary storage and production.
File-by-file is the right way to protect unstructured data but most organizations can’t afford to slow their backup processes to a crawl. Vendors need to rethink file-by-file backup strategies so that it is much faster than it used to be. If they can add value on top of that, like enhanced restore, retention, search and support for cloud storage, then file-by-file backups are not only the right method to protect unstructured data, they are the best method.