Unstructured data has always been a sore spot for the data protection process. The growth in the number of files that make up unstructured data sets and the capacity that they consume now threatens to break the data protection model completely. Considering that every indicator suggests the growth in the unstructured data will not only continue but also accelerate IT needs a new strategy so it can stay ahead of the problem.
Unlike most data protection conversations, the problem with unstructured data protection does not revolve around restore speeds. The time it takes to restore an individual file is roughly the same between products and storage types. Even restoration speed of a single file from the cloud is generally not a cause for concern anymore.
The problem with unstructured data protection is everything else; making frequent backups, retaining and organizing the protected copies and finding the exact copy needed for restoration. Solving this problem correctly sets an organization up for success not only with data protection, but with all the other uses of a backup set: retention, restoration, and archive.
Why Traditional Data Protection Solutions Fall Short
Traditional data protection solutions typically back up unstructured data by scanning or “walking” the file system directory structure, indexing that information, looking for files that have changed since the last backup. If the file has changed, it copies it to backup storage.
The advantage of the file walk approach is that the backup system has specific knowledge of each individual file and versions of that file, which it is protecting. However, the problem is that each of these files and versions are files contained within the backup job. Retention and compliance policies can only be granular to the job. If the organization wants to remove an individual instance of a file, it has to remove the job, and any other files that may be in that job. Conversely, if the organization wants to ensure that it retains certain files for a period longer than a default policy for the job, then it cannot meet this requirement either.
The only potential workaround is to have special jobs for each file type or retention type, which means multiple jobs walking the file system but this approach is not viable at scale. The ideal way to handle this problem is to classify data based on tags which can be automatically or manually set. Then jobs can be set to only backup file of a certain classification and have specific retention policies within that job.
Why Modern Backup Solutions Still Fall Short
Modern backup solutions have addressed the unstructured data backup problem by doing some form of an image based block level incremental (BLI) backup. A BLI backup is much faster because it does not interface with the file system, rather it operates below it and is only looking for blocks that have changed since the last backup, and then copies those blocks to the backup device. Even though it is image based, most modern backup solutions can provide file level restores by transparently mounting the volume on the backup device and interfacing with it.
Image or block based backups present several problems. First, it can only maintain a finite number of incremental backups prior to either performing another full backup or running a consolidation job. Both of these efforts take time. Additionally these solutions provide even worse granularity for setting a specific retention of file data. Essentially, it cannot do it. Organizations need to implement another solution to meet their compliance and retention demands.
The Frequency Problem
Another challenge in both the file system walk method and the block image method is the frequency with which the solution can protect data given the time required and limitation on the number of incrementals. New threats like ransomware have the potential to strike at any moment, and unstructured data is the prime target. Once a night backups of unstructured data is no longer acceptable given the risk.
The Secondary Storage Problem
Another challenge facing unstructured data protection is the secondary storage requirement. The secondary storage system has to maintain at least one copy of the primary storage and in almost all cases, it stores at least two copies. In reality, most organizations find that their backup storage is 5 to 10X the size of primary storage. It can be even worse if organizations are making additional copies for other purposes such as retention or archive.
While secondary storage systems have capabilities like compression and deduplication to alleviate some of this capacity requirement, there is no question that it is still a major issue. Cost of these secondary storage systems is of course a real concern but a bigger concern is the data center floor space that they consume. The cost of storage may be continually declining, however, the cost of a new data center is continually rising.
The Lack of an Exit Strategy
The final problem is that both of these unstructured data protection methods do not provide any means for escape. The problem will just continue to get worse as unstructured data grows and unless the data protection solution can lay the foundation for archiving old data off primary storage, IT will be like the hamster on the wheel, never getting ahead of the problem.
What IT Needs
IT needs a new way to handle the protection of unstructured data. First, unstructured data protection needs to return to its more granular roots. Image backups were a band-aid to solve a performance problem but sacrificed any means of compliance and retention. As both compliance and retention become more critical, lack of those capabilities is no longer acceptable.
Of course, the granular understanding of the files IT is protecting cannot result in weeklong backup jobs either. The solution is an agent like solution that can monitor the file system and make copies of changing files at specific and narrow intervals. The solution should make these copies to secondary storage or to the cloud, but it should also self encrypt those files so that they are not exposed to an accidental or purposeful breach of the cloud account.
This type of solution also lays the foundation for archive. Any archive process must first start with creating a known good copy of data on a secondary storage device. Once in place IT can remove old data, either manually or programmatically, with the comfort of knowing it is stored safely on less expensive storage.
Both legacy and so-called modern solutions for unstructured data protection have run into a perfect storm. Not only are the number of files and capacity requirements growing, the demands to ensure data retention or removal based on regulations are becoming more prevalent. Unstructured data protection no longer can remotely access the file system; it must be on the file system and be able to interact with it and make copies of changed data more frequently. Unstructured data protection storage also needs to be more native so that individual policies can be set and data repurposed.
Sponsored by Aparavi