So many people create files they use once and never use again. They create a spreadsheet for this afternoon’s meeting to support their belief that the department needs more widgets. Once they get approval for the widgets, they don’t need the spreadsheet anymore – but they don’t delete it. Five years later, that spreadsheet is still sitting on the file server taking up space on disk and in the backup system. To say this type of behavior could be optimized is putting it mildly.
Of course the problem is not just limited to Word, Excel and Powerpoint files. Windows file servers also store log files, CAD files, video surveillance and other types of unstructured data. But just like office productivity files, this data often follows the same pattern of a short creation/modify cycle and then a long cycle of being dormant before needing to be accessed again – if at all.
Step 1: Identify
The idea is relatively simple: identify unused files and put them somewhere else while making it look like you didn’t move them at all. The implementation of this idea is somewhat more complex. This is an ongoing problem so IT can’t solve it in a single project. This needs to be handled automatically in order to continually identify new candidates for optimization and make the whole process is seamless to the end-user.
A file goes through a lifecycle of importance, starting with its initial creation. During that period the file is used and looked at frequently, such as someone reviewing security camera footage of last night’s break-in. A file server is the perfect place for unstructured data in this phase of the lifecycle. But typically within just a few days, the importance of such files declines and users will probably never access it again. At that point, a more appropriate place for this file would be some type of secondary storage system, such as object storage or cloud storage. These systems are more cost effective and better suited for long term data retention.
The most common way for users to identify files that are no longer in use is to look at access time. If a file hasn’t been accessed in six months, it’s probably a good candidate to be moved. Different organizations may have a more aggressive timeframe, such as migrating files that haven’t been looked at for over 30 days. Others may be more conservative and prefer to leave files on primary storage even if they haven’t been looked at for a year. Pick a time that is appropriate for your business.
Since files are created every day, each day will also add new files to the list of files that can be migrated to less expensive storage. This is why the process must be automated. Running searches for infrequently accessed files is a waste of time for a storage administrator. Typical searches also use a lot more computing and I/O resources than an automated system that’s keeping track of all the files for you.
Decide Where to Move
Once you’ve identified the candidates that can be moved, you’re going to need a place to move them. The most common sense place to move such files these days is to some type of object storage system. There are multiple reasons for this, including cost, data integrity, and searchability. Object storage systems are typically built on scale-out designs that are less expensive than traditional storage, so they tend to cost less. Each object is given a cryptographic hash that can be used to easily locate the object during any sort of search operation, and it can also be used to verify the integrity of the object both during retrieval and proactively. An automated process can occasionally check each object, re-create its cryptographic hash, and compare it against the original. Any change in the hash indicates corruption in the object, and the object can be replaced with another copy. This makes object storage a great target for long-term storage of reference data.
Seamless is Critical
Once you’ve identified the candidates and where you’re going to send them, it’s important to make this move seamless to the end-user. You’re not going to tell the end-user that if they can’t find their file, it’s probably been moved to the object storage system and they should search for it there. Instead, you can use some type of pointer or reference system so that when a user accesses a migrated file, it is automatically retrieved from secondary storage and placed back on primary storage. There are various different ways that these pointer or reference systems work, but the important thing is their seamless nature. Retrieval of a migrated file should be invisible to the end user. Also make sure that whatever system you use will work long-term even if you make changes to your object storage system.
Can You Do It Yourself?
The problem is performing all these tasks manually is nearly impossible given the number of files most organizations store. A software solution that integrates both with the operating system of the source file server and with the target object storage system makes this process much more painless. It will automatically identify the files that need to be moved, move them, make some type of reference to them so their movement will never be noticed, bring them back to primary storage if they are ever accessed, and maintain the references if the archive copy is ever moved. This is simply too many steps for the average person to automate.
Some might argue they accomplish this by occasionally looking for infrequently accessed files, backing them up, and then deleting them. Besides the fact that this makes them incredibly difficult to find when an end user is looking for them, it’s a misuse of what a backup system is for. Backup systems are designed to restore the system the way it looked yesterday, not to hold on to unused files. That is what an archive system is for.
Don’t Forget Backups
Please do not use your backup system to save unused files. Have some type of automated system that automatically identifies such files, moves them to secondary storage, and put some system in place so that if a user accesses these files, they simply come back. This method doesn’t solve the never ending problem of users never wanting to delete their data, but at least it minimizes its impact on the cost of the production system and the cost of the backup system that protects it.
The never-ending growth of unstructured data on Windows file servers continues unfettered into most data centers. You have three choices: Force people to start deleting things with quotas, manually start moving things to archive storage and manage that process yourself, or use some type of automated system to manage that process. The first two choices often cost more than they save, but the last choice seems to make sense. Automatically move unused files out to storage that’s so cheap and self managing you can afford to ignore it. It may not be a total solution, but it’s the best we have at this moment.
Sponsored by Caringo