The cloud is a valuable tool that almost every data center should leverage. It is, however, just that, a tool. The cloud is not the be all end all answer to all of IT’s problems. It can solve many of them but there are other alternatives that can solve similar problems more cost effectively. IT planners need to leverage the cloud, just as it would any other resource, for what it is best at and then leverage other resources for what they are best suited for.
What is the Cloud Good For?
Fundamentally, the cloud is good at temporal workloads. If an organization needs rapid analysis of massive amount of data, the cloud enables them to leverage hundreds of processors to get to the answer fast. In terms of storage and data protection, the cloud is good managing production data, which few use it for and for disaster recovery data. Both of these use cases tend to have finite data sets. Production data is data that was recently active and the disaster recovery set is the most recent copy of data. DR data will typically be a close approximation of production data.
Is Cloud Storage Good for Long Term Data Retention
Another area that vendors will claim that the cloud is also good for is the long term retention of data and for data archives. This claim is up for debate.
Data is retained for a variety of reasons, sometimes to meet a compliance or regulatory demand, others it is a “just in case” use case. One of the primary goals of archives is to move older data from primary storage to a secondary storage target to reduce the growth of production storage systems. Archiving should also include moving older backups to lower cost storage than the initial backup storage target. Mixed in between these use cases is adherence to data privacy regulations, specifically right to be forgotten legislation. While there is as of yet no clear ruling, it is clear that removing data from an archive is easier than removing it from backup data sets.
One of the downsides to on-premises secondary storage systems like object storage systems and tape libraries is the upfront investment is significant, both in terms of capacity and costs. Cloud storage, because it is purchased incrementally, a TB at a time, enables the organization to start small with their retention and archive initiatives and then as the need for secondary storage capacity grows, they can incrementally add it.
The archive to cloud first strategy becomes challenging when the three year cost of storing regulated and archived data is more than the three year cost of owning the object storage system or tape library. The periodic recurring cost of terabytes, if not petabytes of cloud storage, adds up over the course of time. In most cases the cost of even a few hundred TBs of data in the cloud storage for 7-10 years is far more than that of a tape library even though it has higher upfront front costs. There is also the cost of egress charges to consider anytime data is retrieved or moved from the cloud back to local storage. Depending on the quantity of data retrieved, these costs could be significant.
When the organization reaches the point where the periodic cost of renting infrastructure is more expensive than the cost of paying for it upfront and owning the infrastructure, it needs to decide if the potential operational savings is worth continuing to store regulated and protected copies of data in the cloud. There are clearly some organizations that decide that the ease of use of cloud storage outweighs the cost savings of owning the object storage system or tape library so a periodic payment is no longer needed. Others will eventually decide that owning the infrastructure is more cost effective.
Cloud Storage Performance vs. Tape
Another aspect for IT planners to consider is the accessibility of the data. An advantage for cloud storage and object storage is that if it is stored correctly then it is accessible by applications that understand the S3 protocol. These applications can operate directly on the data without having to have it restored out of the backup format. The access is variable. Some organizations need access to all of the data others need access to just a subset and still others don’t need to access the secondary copy at all.
How much of the archive data needs to be accessed, how frequently the organization will need to access it and how quickly the data must be provided to the requester all factor into the decision making process. Unless the data on secondary storage is frequently being used in an analytics process, tape should be the primary storage area for secondary data. Most requests for secondary data, outside of the analytics use case are not nearly as time sensitive as active recovery efforts.
A Secondary Storage Strategy We Can Live With
Secondary data is a combination of recent backup data, old backup data and old production data. To be able to cost effectively store this information, secondary storage should be a combination of four types of hardware. First, an affordable flash tier for rapid recovery situations. The flash tier is a target for replication jobs and servers that will recover from a “boot from backup” (Instant Recovery). Second, a storage device that is cost effective and can scale to meet the active requirement of the secondary data set. The second tier can be cloud storage, on-premises high capacity NAS or object storage.
The third tier for most organizations should eventually be a tape library. That said there is a capacity curve that organizations need to surpass before tape becomes cost effective for them. While the TB per dollar cost of a piece of tape media continues to lead the industry, before the organization can realize the savings from that media, it needs to also purchase the tape library and some number of tape drives. Then the organization has to need enough media that the cost savings per media versus other storage technologies, cloud storage especially, overcomes the cost of the library and drives. The actual number of TBs required to hit the right point on the graph will fluctuate from year to year based on pricing of the various technologies.
Once the capacity requirements justify a tape library it should then contain a copy of all data, including active backups. Then as data ages on production storage or a higher level secondary storage tier it only needs to be removed from that tier, not copied again to tape. Having all data on tape provides an off-line benefit that no other media can offer especially in terms of protection from ransomware and malware. Any data that is online is always at risk of ransomware and malware attacks.
However, the tape tier should assume a bigger role than just storing an off-line copy of data. If data can be removed from production storage and the higher tiers of secondary storage, over all costs will drop as does data center floor space consumption.
Tape is not dead. In fact, given the rate of unstructured data growth combined with the threat of cyber attacks the value of tape is increasing. Most IT planners will need a combination of the various storage types to meet the demands of the organization. Each storage type has its advantages and disadvantage and its application in the data center will change over time but excluding one format or the other is short-sighted.