The General Data Protection Regulation (GDPR) forces organizations to evolve from a data protection mindset to a data management mindset. IT can no longer let backups store data on secondary storage as giant blobs of ones and zeros. The process that transfers data must also possess an intimate understanding of the data it is storing.
The Backup Problem
Most organizations count on their backup process not only as a means to recover from data corruption or hardware failure but also for data retention. The problem is because of the growth of unstructured data, especially as it relates to the number of files, most backup solutions now backup and store that data as images. For legacy backup products, an image based backup of a large file-system with hundreds of thousands of files is actually faster than a file-by-file backup. While individual file recoveries from a particular backup job are possible, searches across backup jobs is difficult.
Even if the files are backed up file-by-file, most backup solutions have relatively rudimentary metadata databases. They typically only provide a file name and media location. Legacy backup products also store data by backup job, with all data backed up during that job’s execution stored together regardless of data type. While searching across a file-by-file backup is possible, removing files from within a job and having that job still remain viable, is not.
The Archive Problem
An alternative is to not use backup for retention and only store backup jobs for a few days. Retention is then done by an archive. Most archives store each file discretely so finding and removing files is more straightforward. The challenge with this alternative approach is that it requires two time consuming passes across the file-system. Data is also stored twice, once by each process, and in most cases in two separate storage systems. The second pass, performed by the archive software, is not optimized for performance like the backup pass is. The result is the archive pass takes even longer to complete.
Additionally, unless data is aggressively removed from primary storage, which many organizations are not willing to do, the archive approach is more expensive and more time consuming than traditional backups.
Solving GDPR by Integrating Backup and Archive
A more logical approach, since data needs to both be protected and retained, is to integrate the two processes into a data management solution. The transport component of the solution performs a file by file backup of the environment, but uses a journaling like approach so that after the first backup job is complete, subsequent data transfers of new or changed data complete quickly.
Data is then stored not by job but logically, by file. The solution tracks file versions and builds a rich metadata index of all the files it is maintaining. The software could optionally remove files from primary storage if the organization so chooses. Its data structure also makes it easier for the solution to tier data to the cloud so that on-premises secondary storage doesn’t exceed data center capacity.
An integrated data management approach means that GDPR’s right to be forgotten requests are easily executed. Removing John Smith’s data from the secondary data store is as easy as removing it from primary storage. In fact the data management software, since it has a journal of what is on primary storage, can in a single pass remove data, from both primary and secondary storage. The software could eventually log the transaction as proof that John Smith’s data is removed.
Secondary data, stored granularly has value beyond GDPR. For example, ransomware malware files often site idle for weeks prior to execution. During the idle time they are backed up. An integrated protection and data management solution could leverage threat lists to scan the secondary storage repositories and remove any malware files that have made their way into them, ending ransomware attack loops before they begin.
Conclusion
Today, there are new demands and new threats to data. IT professionals need to rethink the processes that protect and retain the information that their organizations create and store. Backup and archive have always been considered separate processes and while the mantra “backup is not archive” is truer than ever, the concept of “archive can be backup”, deserves consideration.