Storing primary data in the cloud flies in the face of logic. Data is typically stored where the user would create it and there is one reason for that: physics. There was a time, of course, when storage and compute were always in one physical box – then came networked storage. Shared storage introduced a few milliseconds or so, but that was deemed an acceptable tradeoff for the advantages that shared storage brought to the table.
Shared storage also introduced another problem: the noisy neighbor. If multiple servers are sharing the same storage resource, the other servers in a sharing network can significantly impact the performance of one server’s storage. This is why some storage products introduced Quality of Service (QoS) features.
Now consider the public cloud. Like shared storage in the data center, moving your data into the cloud has a number of benefits: scalability, elasticity, economics and data protection. Good cloud storage providers continuously replicate data to multiple locations, allowing it to survive many kinds of disasters. In addition, cloud storage also allows companies to store their data far from any disasters that may befall their own data center.
However, when the computer creating the data is in one data center and the computer storing the data is in another data center on the other side of an Internet connection, there is now going to be far more than a few milliseconds of latency. The farther the distance between the two, the more latency you will introduce. Unfortunately, many of the cloud storage vendors that do a good job of protecting data may be a very long way from the data center. Therefore, using the cloud to store data created in a local data center will be a tradeoff between performance and data protection. You cannot have both. There are other tradeoffs to be made, of course, but this is one of the most difficult tradeoffs to deal with.
Latency not Bandwidth
The problem with moving data into the cloud is latency, not bandwidth. And unlike bandwidth, you cannot fix latency problems with money. If you have enough money, you can lease dozens of Gigabit lines and trunk them together. Do that and you have yourself several terabytes of bandwidth. Bandwidth you can buy, physics you cannot. If it takes 20 ms to complete a round trip with one leased line, it will still take 20 ms after leasing dozens of them.
The problem here is that most I/O operations are expecting an ACK, or acknowledgement, that the file, block, or object has been written. The problem with an ACK is that it requires a round-trip. You must transfer the data to the storage location, and then the acknowledgement of that transfer has to traverse back the same distance.
Assuming you’re using a cloud provider that puts your data a reasonable distance away, you’re looking at a possible 1,000-mile round trip, or more. It wouldn’t be too bad if that trip was all via fiber, but it won’t be. A good portion of the trip is likely to be through electrical systems, transferring the data from fiber optical to copper-based systems for routing, and back to a fiber optic system for long term transfer. A website the calculates typical round trip latency times estimates that most 1000-mile round trips will take 20 ms or more. And if an application is waiting for an ACK before it moves on to the next file/object/block, 20 ms is an awfully long time.
Two Extremes and a Compromise
Consider two extremes of where you can put data: all on premises and all in the cloud. There is nothing wrong with keeping data onsite. It offers the fastest data performance and the best security. The one challenge it does have is data protection. One must figure out some way to get the data backed up and transferred to some other location to protect it from disasters. The more data you put on site, the harder that job becomes. Therefore the hardest challenge of storing all data onsite is data protection.
As mentioned earlier, moving all of your data into the cloud has performance challenges. It’s simply not feasible to have your applications experience a 20 ms latency every time they write data to the storage device. One advantage of this method, however, is that the cloud provider handles data protection for you. Most cloud storage providers automatically replicate your data to multiple locations. The problem is really performance.
The typical choice at this point is, as some call it, the hybrid model. Install some storage hardware on premises and use it as the local cache for the real repository of the data, which is a cloud storage provider. This is the choice for many people as it offers a solid balance of performance and data protection. The hybrid model protects your data very quickly because the system replicates it offsite as soon as it can, but applications immediately receive the ACK they are looking for because they’re writing to the local appliance.
This model is not perfect, however. The first challenge is to get the right size appliance to hold all of your most recent data. If the cache is not the right size there will be cache misses and applications will occasionally ask for data that is only available from the cloud. When that happens the applications will experience the same performance as mentioned in the first extreme.
The second challenge with this model is that data is asynchronously protected. The whole point of this model is to acknowledge the write right away and then asynchronously replicate it offsite. Depending on a number of factors, the offsite copy of the data may be way behind the onsite copy. This needs strong management to make sure that they don’t get too far out of sync.
A New Option
In 1999, a company called Akamai developed the concept of a content distribution network. The idea was simple. Large web companies would pay them to replicate their content to points of presence (POPs) all over the world. That way when you access an image or movie from that website, your browser will pull that image from a replicated copy that is much closer to you. All major websites use CDNs.
What if a company did this, but for data storage? They create PoPs in each of the colocation facilities around the world, allowing your application to write to cloud storage that is much, much closer than any alternative. If they did this, they could offer latency that is only a few milliseconds, guaranteed. This would allow you to write all data offsite and have it immediately replicated to other cloud storage vendors. This way you get the benefit of local storage, with the data protection benefits of having the data stored somewhere else.
Storing everything on site has data protection challenges. Storing everything in a typical cloud storage provider has performance challenges. The hybrid storage model is not perfect either, the main challenges being sizing the appliance and the bandwidth. The idea of a CDN for storage presents an interesting alternative to all of these options, giving you the data protection advantages of having data offsite and the performance advantages of having it onsite — or at least really close to onsite.
About ClearSky Data
The ClearSky global storage network combines primary storage, backup and disaster recovery with on-demand scaling and agility. It automatically caches data – moving cold data to the cloud, keeping warm data in metro-based PoPs within 120 miles of customers, and storing a small portion of hot data on-premises. Customers retain the performance and availability of a local storage array with the comprehensive security, low latency and high availability required for enterprise applications.