What’s The Best Way To Archive Data to the Cloud

Posted on December 5, 2017 by George Crump

Cloud Storage is an ideal target for the dormant data that is clogging up primary storage systems. For organizations looking to archive their dormant data to the cloud, there are plenty of vendors offering a solution. But these solutions are not created equal, and selecting the right one can mean the difference between fully taking advantage of the cloud or just using the cloud as a digital dumping ground.

Why Archive to the Cloud?

The concept of archiving has been around for as long as there have been data centers. Archiving generally serves two purposes. The first is to secure and preserve data critical to the organization to meet legal or regulatory retention requirements. The second is to slow the growth of data on primary storage. Most of that growth is in the form of unstructured data like images, audio, video, as well general purpose files. These data types are ideally managed by some sort of archive process.

The historical challenge with archiving, and why most organizations don’t do it, is where to put all the data. It is a generally accepted belief that somewhere between 80 to 95% of an organization’s data has not been accessed in the last 90 days, and thus is a candidate for archiving.

The problem is that, at least on day one, this data is already stored on storage that has been bought and paid for. It doesn’t make a lot of sense for the organization to buy an archive storage system that has capacity almost the same size as the primary data center’s capacity, plus buy another system to store the DR copy.

A better practice is for the organization to buy only the amount of archive capacity that it needs to avert the next storage system expansion or upgrade. Unfortunately, most archive storage solutions are sold in very large initial storage capacity. A 100TB initial purchase is not uncommon.

Enter cloud storage. Because cloud storage can be bought incrementally, a terabyte at a time, the organization can archive to the cloud as needed to prevent having to buy a storage system upgrade.

How to Archive to the Cloud

There are several methods to archive to the cloud, the most common is some sort of NAS gateway. That allows a Windows or Linux server to “see” the cloud as a SMB or NFS mount point. Another is a file-system approach that overlays the storage infrastructure creating new pools of storage that may have a mix of on-premises NAS and Cloud Object Storage. Both of these solution types typically include some form of data management capability that automatically moves data based on policies IT creates. An example policy might be, “Move all data that hasn’t been accessed in the last 90 days to cloud storage.”

The Problem with Today’s Cloud Archive Solutions

A key problem with both archive solution types is they create a separate storage area that IT needs to maintain and monitor. Data has to be moved from the primary storage systems to these storage systems prior to the policy taking effect. Or IT needs to decide to move specific shares to the secondary storage system and have it managed there. All data moving to the archive, in this case cloud storage, has to be moved through this gateway which creates the bottleneck.

Another problem is that by definition these archive solutions present a SMB or NFS mount. That means they come with all the limitations of a network file system vs. a local file system, like NTFS or one of the Linux file systems. A network mountable file system typically provides the lowest common denominator of capabilities to increase compatibility with the various devices that will access it. That means, as part of the conversion many of the attributes of the file may be lost.

Transparent recall of archived data presents another problem. Most solutions use a stub file technology when data is moved from the on-premises storage to a local archive but few support stubbing to a cloud archive. These stub files will automatically recall the file when accessed by a user or application. The stub files are fragile but very important. If they are deleted or corrupted in any way the ability to retrieve data is severely compromised. Protection of these stub files is as important as protection of the actual data.

A final problem relates to the source of the growth of much of the unstructured data set, devices, most of which don’t natively store data to SMB or NFS. Many cameras, media imaging devices and Internet of Things devices write data to a NTFS file format and in most cases this data is first loaded onto a server running that same file system. That means to be archived it needs to be copied over so it can be subsequently managed.

What IT Needs?

IT needs a simple solution that can archive data in place extending the current file system instead of replacing it. By extending the file system, the solution does not require a separate mount point or a manual movement of data. Installation is much more seamless and assures compatibility with the existing infrastructure. Most modern file systems have the ability to have these extensions put into them. Microsoft, for example, has a filter layer that enables an archiving tool to run in between the operating system and the file system in a fully supported manner.

If the data management intelligence is put into the filter layer each server in the data center storage data has direct access to the cloud. This direct to cloud access also means performance now scales in lockstep with file server growth, essentially eliminating the potential for a bottleneck.

The file system extension approach also lowers the cost of the solution, enabling the organization to realize a quicker return on its investment. Not only does the organization save the cost of the gateway appliance, it does not have to worry about multiple copies of data. The file system becomes the sole authority.

The extension of the file system also means the data written on cloud storage is in the same format as the data written locally. In the case of Windows, it is NTFS throughout. In the case of Linux it is NFS throughout. Extending the file system means unique file system attributes and security settings are preserved.

The file system extension approach also increases compatibility with other process that might run on the server. For example, a virus scan run on a non-NTFS file system might accidentally recall all the files in the archive as each stub file is examined. With an NTFS extension, the two software solutions should work together.

IT also needs a solution that is not myopic. Some organizations will not want to archive data to the cloud or they may want to use a combination of archive targets, creating a hybrid archive. The solution should support a variety of archive storage options including direct attached storage (DAS), network attached storage (NAS), block storage (SAN), and even tape media in a tape library.

A final need to be met is the need to aggressively protect stub files. They are the weak link in most approaches. While some archive solutions make a redundant copy of the stub data, that copy is almost always made locally and it is up to the IT professional to make sure that the metadata is moved off-site. Ideally the archive solution should make sure a copy of the stub data follows the data being archived, in this case a copy should be made in the cloud.

A cloud copy of metadata protects the organization from accidental deletion of stub file data and it also provides a powerful disaster recovery capability. In the event of a disaster, the process of recovering a server after the operating system has been installed is to simply copy down the stub files from the cloud. The new file server is ready within minutes. Then, data is retrieved as users begin to access it or IT can programmatically bring back the most active data.

About George Crump

George Crump is the Chief Marketing Officer at VergeIO, the leader in Ultraconverged Infrastructure. Prior to VergeIO he was Chief Product Strategist at StorONE. Before assuming roles with innovative technology vendors, George spent almost 14 years as the founder and lead analyst at Storage Switzerland. In his spare time, he continues to write blogs on Storage Switzerland to educate IT professionals on all aspects of data center storage. He is the primary contributor to Storage Switzerland and is a heavily sought-after public speaker. With over 30 years of experience designing storage solutions for data centers across the US, he has seen the birth of such technologies as RAID, NAS, SAN, Virtualization, Cloud, and Enterprise Flash. Before founding Storage Switzerland, he was CTO at one of the nation's largest storage integrators, where he was in charge of technology testing, integration, and product selection.

Tagged with: Archive, Cloud, Corruption, DAS, dr, IoT, NAS, NFS, NTFS, Object Storage, SAN, SMB, Unstructured data
Posted in Blog