Organizations of all sizes are trying to develop a cloud strategy, both in terms of applications and storage. Most of these organizations are finding that the promised cost savings of the cloud are not as great as they had hoped, especially when it comes to storage. The reality is that the cloud is best at temporal data not at data that needs permanency. In other words, the cloud is cost effective at computing today’s data, but those costs become less compelling when it needs to store data for an extended period of time.
The Cloud Cost Model Challenge
There are two resources that cloud providers charge for: compute and storage. In terms of compute they charge for the amount of CPU resources being used by an organization’s applications over a period of time. Essentially, the organization is renting as much CPU time as it needs when it needs it – but ONLY when it needs it. This is an ideal use of the cloud. The organization can scale up processing when the need arises and then scale back down when it passes. This is by definition a time-based or “temporal” relationship.
The other use of the cloud is to store data. This data can be the information that the above applications need to process, but can also be data from backup or archive. Cloud providers usually charge by the amount of capacity that the organization is consuming and sometimes for the amount of data transferred out. This relationship with the provider is much more permanent and costly. While they do an excellent job of providing this storage at a very low upfront cost, when the periodic charges are added up it can get expensive.
There are a lot of factors that go into the total cost of storage other than hard costs. For example power, cooling, floor space and IT operational costs should be factored in. Storage Switzerland has determined that most organizations that have more than a petabyte of actual capacity should consider an on-premises storage infrastructure over a cloud based one.
Is Your Cloud Strategy Backwards?
The problem with most cloud strategies is that they almost always start with using the cloud for storage, typically backup then archives. Some strategies then evolve to all storage, even active data, being in the cloud and then leveraging caching so the very active data can be on-premises for applications and users. This strategy’s end game is all the applications shifting to the cloud and those applications’ data being stored there as well. The organization’s data center eventually becomes an operations center.
In general there is nothing wrong with this strategy until the capacity in the cloud begins to reach and surpass 1PB. Then it may not be economically feasible compared to the cost of keeping data internally. But the cost calculation must be done on more than just the cost to store and manage data. The cost to run the organization’s applications on-premises must also be factored in. And as stated above, the cloud is very well suited for application processing.
A New Cloud Strategy
A new strategy is needed that can leverage the temporal processing advantages of the cloud and the cost effectiveness of on-site storage. In short the data center should remain the data center and the cloud should become the processing center. For this strategy shift to occur active data needs to be cached in the cloud instead of the data center, essentially reversing the way that cloud caching appliances are used today. If cloud appliance / caching vendors were to virtualize their current hardware appliance platforms (so they could be instantiated in the cloud) then this change can occur.
Larger organizations could leverage the cloud for its compute muscle and let it host applications. Those applications could scale up processing almost limitlessly thanks to the cloud, but only during peak times when they might need that processing. During less peak times they could scale their processing purchase to well below what they normally might have kept in their own data centers. For larger organizations all of their primary, more permanent data would remain on-premises, only being accelerated to the cloud via the cloud-based caching appliance when that data is active. For those cloud hosted applications, storage performance would still be excellent since it would be cached on the same server as the application and cloud providers often use flash or DRAM as storage for hosted applications.
Adding this strategy to the list of potential cloud use cases would allow it to appeal to all data centers regardless of size. Smaller data centers, with less than a PB of capacity, could leverage an on-premises appliance to cache data locally and run applications in either the cloud or the data center. Large data centers would leverage the cloud just for applications, but most of the storage would be on-premises.
The Path To a New Strategy
This new cloud strategy requires an application that can run in the cloud. Ideally this is done with a native cloud application so that it can take advantage of the scale-out capabilities of cloud compute, but a cloud application is not required. Most providers today can run legacy applications in the cloud, allowing the data center to take advantage of most of its capabilities without having to completely re-write code. And, many legacy application platforms are quickly embracing a cloud design so that they can fully exploit cloud capabilities like scale-out compute and availability zones.
Another step in this path is, again, the instantiation of cloud caching appliances in the cloud and essentially reversing the caching process. Instead of caching data in the on-premises data center, they will cache data to the applications running in the cloud. This model will also allow the instant scaling of these appliances as well. If a larger cache is needed for a short period of time it can be purchased for just that period of time. The cloud caching appliance could adapt as quickly as the workload adapts.
Conclusion
The biggest long-term challenge to cloud adoption is early success. As an organization becomes more comfortable with the cloud, and more skilled at taking advantage of its capabilities, the more data it ends up storing in the cloud. When the size of the cloud data set surpasses 1PB the cost to repurchase that 1PB on a recurring basis can make the once economical cloud look expensive even factoring storage management soft costs. Plus there is the ongoing concern of cloud security and cloud breach. By reversing the traditional strategy and keeping data in the data center, organizations can keep costs down while leveraging the cloud for what it is best at, the use of cloud compute to process temporal data.

Well, the “goal post” of how much data can be “economically” be stored in the public cloud seems to be getting larger. Recently, Mr. Crump wrote that once you get a few hundred terabytes stored in the public cloud, you should think seriously about moving it to your own local storage cloud. Moving hundreds of terabytes or a petabyte of data out of a public cloud is going to be costly, unless the public cloud storage provider has a more efficient method available to forklift it from the public cloud to your own local storage cloud that doesn’t involve using the Internet. Realizing after a few years that you have a cost containment problem when keeping hundreds of terabytes or a petabyte of data in the public cloud seems a little late. It would be much better to start with a local storage cloud for all warm, cold and archive data and only move the hot or transactional data into the public cloud, if the applications for using it are being run there. The concept of caching only hot or transactional data in the public cloud has a great deal of merit. One thing not considered is storage software vendors have license fees (subscription or perpetual) and they typically charge annual fees for maintenance and support based on how much “usable data” you have in your local storage cloud. A model is needed that compares the fully burdened cost of local cloud storage with public cloud storage. Rules of thumb are useful, but the issue needs a more granular method in order to compare the fully burdened costs of data storage.
I think it depends on how well we utilize or optimize the infrastructure or services. For example( excerpts from aws glacier, a cushion to save cost):
——————————————————————————-
TCO example 1: Let’s assume that you upload 1 PB of data into Amazon Glacier, that the average archive size is 1 GB and that you never retrieve any data. When you first upload the 1 PB, there are upload request fees of 1,048,576 GB x $0.05 / 1,000 = $52.43. Then the ongoing storage costs are 1,048,576 GB x $0.004 = $4,194.30 per month, or $50,331.65 per year.
TCO example 2: Now let’s assume the same storage as example 1 and also assume that you retrieve 3 TB (3,072 GB) a day on average using Bulk retrievals and that the average archive size was 1 GB for a total of 3,072 archives. That’s 90 TB retrieved per month or 8.8% of your data per month. The total retrieval fees per day would be 3,072 x $0.0025 + 3,072 * $0.025 / 1,000 = $7.76, which equates to $232.70 per month and $2,792.45 per year. Adding storage costs, your annual TCO is $50,331.65 + $2,792.45 = $53,124.10. In this example, retrieval fees make up just 5.3% of your total Glacier fees. Your total monthly cost per GB stored including retrieval fees is $0.004222/GB.
TCO example 2: Now let’s assume the same storage as example 1 and also assume that you retrieve 1 TB (1,024 GB) a day on average using Standard retrievals and that occasionally you use Expedited retrievals for urgent requests, averaging 10 GB per day. Here, we assume the average archive size is 1 GB. That’s 30.3 TB per month or 3% of your data per month. The total retrieval fees per day would be (1,024 x $0.01 + 1,024 x $0.05 / 1000) + (10 x $0.03 + 10 x $0.01) = $10.69, which equates to $320.74 per month and $3,848.83 per year. Adding storage costs, your annual TCO is $50,331.65 + $3,848.83 = $54,180.48. In this example, retrieval fees make up just 7.1% of your total Glacier fees. Your total monthly cost per GB stored including retrieval fees is $0.0043/GB.
——————————————————————————————-
We have EFS, EBS and S3 as per requirement to bring back and forth and we can maneuver with strong compute engine like spark or spark applications for Big data, Hadoop…