The case for archiving is beyond compelling. Most organizations don’t access data once it is older than 90 days. In most cases, inactive data can consume 85% or more of the organization’s primary storage capacity. Moving inactive data to secondary storage or to the cloud instead of continuing to buy more primary storage can save an organization a significant amount of money. It can also lower data protection costs and limit the organization’s exposure to malware and other cyber-attacks. With all that archive has going for it, why doesn’t every organization have an aggressive data archiving strategy?
There are a lot of reasons why organizations don’t establish a formal data archiving strategy. The list of reasons is so long that many archiving vendors have given up calling the process archiving; instead calling it data management or cloud storage enablement. If an organization is moving inactive data to improve retention and lower costs, it IS archiving it. What features vendors add to their archive solution is a matter of debate but at its core saving money is almost always the number one motivation.
The Stub File Problem
A key reason that organizations don’t want to move data from primary storage is their fear that the moment they move files that were not in use for years, they will suddenly become active. It is the “Murphy’s Law” of archiving. To get around this problem, most archive solutions create a mechanism to automatically recall files that the solution moved. One of the most common mechanisms is stub files. Stub files are small, 1K files that are placed in a file’s original location and point to a file’s new location. They seem like a good idea, but they cause so many problems they can be an archive project killer.
The first problem with stub files is that they are vulnerable to users. Users may go years without ever cleaning up their home directories but put stub files in that directory an instantly the user becomes a file management disciple. The problem is once the stub file is deleted finding the original file becomes a very tedious manual process.
Another challenge with stub files is other processes that interact with them. Backup solutions, during backup or anti-virus solutions during a scan, may accidentally trigger a recall of the original file by opening the stub file.
Even if the users, backup and anti-virus application can learn to ignore the stub files, there are other issues with their use. The biggest is that a stub file is still a file which means it takes up space on the filesystem and needs managing. Many data centers are forced to buy additional NAS or file servers not because they’ve reached their capacity limits but because the current filesystem can’t support the number of files stored. A stub based archive won’t alleviate a high file count issue. Also, even if the backup and anti-virus applications can learn to ignore stub files it still needs to process the stub file to ignore it. In actuality, these programs don’t learn to ignore the file; instead they learn to not operate on them as they examine them. Examining the stub files, however, still takes time.
Metadata Management – The Stub File Solution
The solution to the stub file problem is to manage metadata centrally. Metadata is the data about the data. It includes information like date created, date modified, date last accessed, data owner and data location. A stub file based archive solution uses metadata to decide if it should move a file and create a stub file. Centralizing metadata means keeping an updated copy of metadata on a dedicated appliance or inline on the network on a switch-like device. Once the metadata is centralized, the device routes users and applications through it for file access. The device points the users to the actual location of the file. The file can move multiple times, and the user never knows it. Most importantly no stub files are required, making the archive process much more straightforward and more reliable.
To learn how a metadata management approach enables hybrid cloud storage for archiving and metadata acceleration, register to watch our on demand webinar “The Hybrid Cloud Data Gravity Problem and How to Fix It“. All registrants get an exclusive copy of Storage Switzerland’s latest eBook “Metadata is Breaking Hybrid Cloud Storage.”