An aspect of the European Union’s (EU) General Data Protection Regulation (GDPR) and similar regulations like California’s Consumer Privacy Act, is the “right to be forgotten.” Simply stated this means that a user or customer of an organization’s resources has the right to ask that organization to no longer store their data. While removing data from primary storage is relatively easy, this aspect of these regulations causes a particular problem when it comes to secondary storage formats.
Right to be Forgotten and Backups
The right to be forgotten in relation to backups is particularly troublesome. Most removal requests will involve unstructured data like documents and images. The total capacity of unstructured data sets as well as the large number of files that they contain, leads many backup software developers to backup these data stores as images instead of individual files. The problem with image backups is the software loses individual granularity across backup jobs, meaning searching all backups for “John Smith’s” data is almost impossible.
Even if this data is backed up file by file, most backup applications are still job based. Removing data from within a job is a very rare capability. In most cases, retention policies have to be set at the job level. A right to be forgotten will require that the entire job be deleted which invalidates the backup and potentially breaks other retention requirements.
Workarounds for Right to Be Forgotten and Backup
There are several proposals for working around the requirements of right to be forgotten. They all hope that backup data is somehow excluded from the requirement because it is not in production or in a usable format until it is restored. This hope is unproven thus far.
If backup data is somehow excluded from consideration, then backup software vendors still have work to do. Most are promising to deliver a “delete on restore” capability. Delete on restore will require an organization to keep a list of people requesting to be forgotten. It is also unclear if keeping a list of people requesting to be forgotten is in compliance. The backup software will then, during restore, see if the data it is restoring belongs to a user on the list, if it is then it will restore it to a “null” device, essentially making sure that the user’s data never comes back into production. It is unclear what impact the constant checking of every file being restored will have on restore performance but it is reasonable to assume it will have a significant impact. It is important to note that at this moment, no vendor provides this capability and adding it won’t be an easy development effort.
Another alternative is to restore all data to a quarantined area, then remove all data belonging to users requesting to be forgotten prior to moving data back into production. This method is more readily available today but is full of concerns. First, it assumes it is acceptable to have a list of users requesting to be forgotten. Second, it assumes that is acceptable to restore all data to a quarantined area. Neither assumption is proven acceptable at this point.
Once the restore to the quarantined area is complete it also assumes that the organization has the tools to scan the data to find data that should be removed. It again also assumes that it is acceptable to keep a list of forgotten users. Finally, this method means that every restore becomes a two step process. First, IT restores the data to the quarantined area and then has to restore it again to production. This method doubles the time to restore even without factoring in the time to scan data, which could easily triple restore times.
Another alternative is to only maintain backups for a very short period of time, five days as an example. Data retention is either the responsibility of production storage or a separate archive process. Retaining data via production storage means never allowing the deletion of production data and possibly maintaining infinite version tracking capabilities, driving production storage capacities (and spending) to record levels.
The alternative, archive everything, means that the organization needs to implement an archiving solution. Storage Switzerland finds that most organizations do not have a formal archiving process in place today. Most organizations use backup for their archive, which won’t work for reasons described above. It also requires the archiving of all data, not just old data. This requirement means that the archive solution will need to scan the environment almost as frequently as the backup solution, which impacts overall performance.
Solving the Right to Be Forgotten Problem
The solution to the right to be forgotten problem is multi-faceted. First, backups of unstructured data need to be done file by file, not by images. Vendors need to develop technology that enables file by file backup without greatly impacting the time it takes to protect data. Second, backup and archive need to integrate into data management. In this model backup becomes the method by which data is transferred but archive is the manner is which it is managed. The data management software then provides the ability to search and remove data directly from the archive/backup copy.
Conclusion
The right to be forgotten creates numerous challenges for an organization looking to comply with GDPR and similar regulations appearing in US state regulations. These regulations are coming at a time when the capacity, quantity and overall value of unstructured data are at an all time high. Legacy data protection solutions are too focused on backup and legacy archive solutions require too many resources to constantly scan for new or modified files. The answer is to integrate backup and archive into a single intelligent data management solution so that the archive process leverages the backup process for data transfer and then the archive process manages the data and fulfills right to be forgotten type of requests.