Can a Deep Archive in the Cloud be Useful?

A lot of people worry that a deep archive in the cloud will be about as useful as the boxes in the deepest part of your attic. The data will be stored in an economical way and the cloud vendor will ensure its preservation – but whether or not the data actually serves any future purpose is highly questionable. The deeper and cheaper the archive, the less likely someone will use or recall it.

This concern that data will go unused is one of the reasons people tend not to archive in the first place. They worry that if they put data anywhere other than online, immediately accessible storage, they will never use it again. Projects that might be able to use the data to bring value to the company will be unable to find it and therefore either re-create the data or go without it.

IT must overcome this fear in order to accomplish the very important task of archiving data to less-expensive storage, and unfortunately the fear is greatest when the archiving destination is the cloud. Simply moving the data to the cloud with no gateway, management interface, or searchability, justifies the fear. But what if a company can make cloud storage appear the same as local, onsite storage? This could assuage these fears altogether.

The fear of the cloud is twofold: object storage is still an unknown to many people and applications, and the performance of the data in the cloud would be extremely variable based on the throughput and latency to the cloud provider. This is why cloud gateways are becoming so popular, as they solve both problems.

Cloud gateways typically talk NFS and SMB, allowing regular users and applications to write data to a POSIX compliant file system they already understand. But a gateway is just the first step, it merely makes the connection. What is really needed is Cloud NAS. In this implementation recently written and recently read data stores in the local cache, and all data asynchronously goes to the cloud provider of your choice. The Cloud NAS seamlessly translates between the S3 API (or other cloud protocols) and NFS/SMB. It then creates a global file system that provides transparent access to data whether it stores on a local NAS or archived in the cloud.

A Cloud NAS system allows customers high-performance access to recent data while ensuring that all data is stored in the seemingly infinite capacity of the cloud. Again, data in the cloud through the Cloud NAS is accessible with the same mount point and pathname that stored it in the first place, regardless of which cloud provider the data is in. If a particular set of data was archived to the cloud and then went unused for a long time, it would eventually move out of the cache to make room for more recent data. However, as soon as users start accessing the older data via the same mount point where they placed it a long time ago, the data would be immediately copy to the local cache for easy and quick access.

Since most cloud providers are delivering multiple classes of storage, Amazon offers S3 and Glacier for example. The Cloud NAS system should also support tiering that data in the cloud. Data could potentially be written to the high performance tier that may have less expensive access fees and then eventually push it to the cold tier that has very low storage costs.

StorageSwiss Take

Cloud NAS makes a lot of sense as it virtualizes a lot of the complexities of using cloud providers by giving customers a simple mount point to write to via the same POSIX-compliant interfaces they already know with NFS and SMB. Customers do need to understand that the local cache needs to be appropriately sized to hold all concurrent workloads, but this is a relatively easy thing to do compared with the complexities of managing the amount of storage they need if they were not using the cloud as deep archive.

W. Curtis Preston (aka Mr. Backup) is an expert in backup & recovery systems; a space he has been working in since 1993. He has written three books on the subject, Backup & Recovery, Using SANs and NAS, and Unix Backup & Recovery. Mr. Preston is a writer and has spoken at hundreds of seminars and conferences around the world. Preston’s mission is to arm today’s IT managers with truly unbiased information about today’s storage industry and its products.

Tagged with: , , , , , , , ,
Posted in Blog
5 comments on “Can a Deep Archive in the Cloud be Useful?
  1. Sudhindra says:

    A Great Article. Just like Data archive, Deep archive in the cloud is required that provides lot of benefits. 1. To improve overall performance of storage/web Apps request 2. To deploy compliance rules as such there is no standards CR for Social media sites 3. For example with quite reasonable period 18 months unused data preferably unstructured data, semi-structured data could be archived in the cloud which creates huge space on existing expensive storage. Some of users neither access nor available for a long time on social media account hence Deep Archive is need of the hour.

  2. Tim Wessels says:

    Well, keeping a “deep” archive in public cloud storage is a mistake. You will pay to keep this data there forever, and if you need it, there will be additional charges for touching it that could come as a surprise. It makes more sense to store “deep” archive data on your private object-based storage cloud. Your can also compress your “deep” archive data to save on storage space. This “deep” archive data should also be erasure coded and not replicated to save on storage space. And finally, you need to be able to search your “deep” archive data, so your private object-based storage cloud should support a search engine like Elasticsearch.

    • wcurtispreston says:

      So if I understand you correctly, it’s a mistake because it’s cheaper to have a private cloud, especially when you consider that you’re going to store the data indefinitely. Private cloud offers more control and it MAY be less expensive, but I’m not sure if I can agree with the blanket statement that it will always be more expensive.

      Public cloud offers no major capital outlay. Private cloud requires the purchase of a lot of hardware up front. By design you have to buy hardware before you need it. This leads to the big issue w/private cloud: overprovisioning. The average utilization of storage in most datacenters is at BEST 25%. That means that private cloud would have to be 4X cheaper to actually be cheaper.

      Also remember that public cloud comes with multiple copies and private cloud may or may not. You’d need three copies (or a geo-redundant erasure coding system) to mimic what Amazon/Google/Azure will do for you.

      Add to this the cost of the floor space, power and cooling, and the FTEs required to manage it.

      It MAY or may not be cheaper. it may indeed be a better model. I’m just not convinced that it will always be cheaper.

      As to the retrieval costs… The degree to which most people use their deep archive makes them trivial to non-existent.

      • Tim Wessels says:

        Well, for trivial amounts of “deep” archive data you might have a case. But since the amount of “deep” archive data will be on the proverbial “hockey stick” growth curve, then it is a mi$take to keep it in a public storage cloud. The advantage of using a private object-based storage cloud is you can size it for the amount of data you need to protect today and grow it incrementally whenever you need more. Just add the new storage node(s) to the cluster. Data will be rebalanced automatically. There is no need to over-provision object-based storage that you won’t consume for years. Cloudian can start with a 3-node cluster. Other object-based storage software vendors require more storage hardware upfront.

        Every object-based storage software vendor supports both replication and erasure coding, although Amplidata (HGST) and Cleversafe (IBM) only support erasure coding. You need to be careful when using erasure coding in a multi-site object-based storage cluster deployment. Amplidata and Cleversafe may have viable methods for doing this since they only do erasure coding, and it may involve the use of hierarchical erasure codes. Cloudian prefers to erasure code data within a site and then replicate the erasure coded data to another site.

        A single storage administrator can manage upwards of 10PB of object-based storage. In legacy SAN and NAS storage environments that number is more like 400TB per storage administrator.

        Object-based storage software vendor Caringo has a “Darkive” feature to lower power requirements in their cluster. Storiant has a similar feature for their “deep” archive solution.

        Retrieval times for “deep” archive public cloud storage, assuming you use it, can take several hours to “make ready” when using AWS Glacier. Google, by comparison, said their Coldline “deep” archive storage starts transferring data in a minute or two. But, that’s when all the charges start adding up for both of them depending how much “deep” archive data you touch.

        You really do need to run the numbers before parking PBs of data in a public storage cloud and then compare it with running your own private storage cloud or using a managed service like Igneous which puts their object storage appliances behind your firewall and bills you for the amount of storage you are using, Sounds suspiciously similar to the ill-fated Nirvanix, but let’s hope Igneous has a better-considered business model than running in a race-to-the-bottom against the big dogs.

  3. wcurtispreston says:

    I’m fine with private cloud vendors. I’m just not fine with blanket statements that say they will ALWAYS be cheaper and it would be a “mistake” to put it in the public cloud. Putting it in the public cloud will also have advantages that might be more valuable to a particular customer than simple cost comparisons.

    I understand most private cloud vendors are built on a scale-out model and you can scale as you grow. But you will always buy more than you need at any given time due to the length of the purchasing cycle. You might buy enough for only a year or something, but you will always buy more. With public cloud you never pay for a single byte you don’t need.

    FWIW, I don’t factor in retrieval time for deep archive. If you haven’t look at it in five years, waiting another couple of hours isn’t going to kill you. if it will, I’d question whether this was really a “deep” archive.

    As to whether a given solution would indeed be cheaper… I’d have to see actual pricing to make that determination.

    I can’t agree more with you on running the numbers.

Comments are closed.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 21,854 other followers

Blog Stats
  • 1,230,520 views
%d bloggers like this: