Backup is the process of making copies of an organization’s data so in the event that data was corrupted, erased or becomes unavailable for any reason, IT can use the copies to restore it. Backup should have the primary goal of copying the most recent versions of all the organization’s data and the goal of most recoveries should be to recover the most recent version of a file. The objective of an archive is the exact opposite: To preserve specific information for a specified period of time, often driven by regulations or corporate governance.
Backup is Not Archive
The reality is most IT organizations look to their backup solution to be their archive solution. To some extent modern backup architectures, thanks to disk backup appliances and increasingly powerful backup software, can accomplish that task. The problem is forgoing a formal archive strategy makes the retention of data far more difficult and, ironically, makes the backup process far more complex.
The key difference is how backups and archives manage data. Backups are designed on purpose to make as many copies of data as possible, while an archive is supposed to make one copy of data, plus a protection copy. Ideally if the data has not changed in a period of time it is migrated or moved off the production storage system, thus freeing up storage capacity and reducing the amount of files that need protection by the backup process.
Another difference is how the data that backup or archive has under management is presented back to the user. Most backup solutions organizes data by the date the copy was made and the job that performed the protection. Archive solutions don’t work on a job basis but they archive a file when certain conditions are met, the simplest example being the data has not changed in X period of time.
Backup Loves Archive
If the organization creates an archive process that removes unchanging data from the backup process (or preferably from the primary store all-together), then the amount of data that needs protection reduces significantly. Considering that in most organizations the percentage of data it considers active is less than 15 percent of the total data set, this represents a dramatic reduction in the amount of data that the backup process needs to copy and manage. As an aside, it also means the organization should have to buy less primary storage.
Merging Backup and Archive
While purists want backup and archive to be totally separate processes, one cannot deny that there is some redundancy in tasks, namely the creation of the copies of data and the storage that the data consumes. Ideally a vendor should be able to merge the two processes. Essentially the backup software creates the copy and then, as the data meets archive parameters (not changing in X backups), it hands off the management to the archive process. The archive process could then move or copy the data to a dedicated archive store. It could also, at that point, create more sophisticated metadata about that data to facilitate a more powerful search in the future.
Backup and archive should be two separate processes in the data center and all data centers will benefit from having a strong archive process. Doing so should reduce the amount of data needing protection and potentially reduce primary storage costs. But there is no reason that the redundant components of these two processes could not merge to ease network congestion and increase data efficiency.
Sponsored by Commvault