For decades data center best practices were to isolate backup and archive. “Backup is not archive” was the mantra. The reality is most data centers ignored the mantra and used their backup process for all of their data retention. Today, however, as data centers deal with unprecedented growth in their unstructured data sets it may be time to change the mantra to “archive is better than backup.” Organizations need to change their approach and take an archive first approach to data protection.
Is Backup Dead?
Backup is a process that copies data from primary storage to some form of secondary storage. As unstructured data continues to grow organizations have modified their software to perform an image backup of the entire file system instead of a file by file backup. Subsequent backups only need to backup changed blocks, further lowering backup times.
The problem with the image approach is that, while vendors provide individual files restores, the administrator needs to know exactly which backup set contains the file they are looking for. The software doesn’t provide a way for organizations to search for specific files across backup sets.
The lack of granular file knowledge creates challenges for organizations looking to maintain compliance with data privacy legislation like the European Union’s General Data Protection Regulation (GDPR) or California’s Consumer Privacy Act. The lack of specific file knowledge means that complying with “right to be forgotten” sections of the legislation are very difficult.
Another challenge is image-based backups are difficult to archive, which makes it difficult to lower secondary storage costs. The baseline image must always be available to compare subsequent blocks. Most software solutions using block-level incremental backup also have a limit on how many iterations away from the baseline image they can track before performance is impacted.
The Archive Alternative
An alternative is to copy all unstructured data to an archive and let archive software manage the data. Most archive solutions have a very specific understanding of every file and every version of every file in the archive. Finding and removing data in response to a “right to be forgotten” is straightforward. Archive solutions can also manage where the archived data is stored and most will support multiple storage tiers so that the older the data is moved to less expensive tiers and becomes the less expensive to store. These solutions can also remove old data from primary storage, lowering primary storage costs.
The Archive Problem
The problem with the archive is how to get ALL data to it, quickly and consistently. Many archive solutions don’t have a transfer mechanism: they count on the administrator to manually move data to it. Some archive solutions do have an automatic file transfer capability but they were not designed to transfer files en masse like backup solutions are. Also, archive solutions typically have no communication path to the backup process. For example, it can’t confirm that the backup process has X number of copies of a file prior to removing that file from primary storage.
Integrating Backup and Archive
The solution is to integrate backup and archive. Vendors need to create solutions that perform high-speed file by file backups so data is protected. Then they need to add an archive function that builds a rich metadata index of data under management so that it is easy to search for data within the repository. An archive function enables the movement of data across secondary repositories and eventually from primary storage driving down costs. IT can move old data from primary storage with the confidence of knowing it is protected X number of times.
The integration of backup and archive enables the organization to protect their data, comply with data privacy regulations and reduce their investments in both primary and secondary storage.