Software Defined Deduplication for the Cloud

Unlike cloud compute which can scale up and down on the fly, cloud storage is more permanent. While organizations that use cloud storage only pay for the exact storage they are consuming, they do also pay for that storage month after month. After a few years of use the cloud storage bills can really start to add up. Everything that can be done to reduce the amount of data in the cloud should be. Data efficiency technologies like deduplication and compression can play a key role in the effort to reduce cloud storage costs.

The Cloud Storage Challenge

Cloud storage is used for off-site backup, archive and even the storage of production data that is going to be processed by cloud compute. But for cloud storage to be viable long term, the cost of that storage has to be driven down. Cloud vendors have gone to great lengths to accomplish this and new, lower cloud storage pricing seems to be announced every few months. But there is a physical aspect to storage that will never go away. The time is now for providers to apply the same data efficiency techniques to their cloud offerings that made backup to disk and all-flash arrays economically viable, most notably deduplication and compression.

The Requirements for Cloud Data Efficiency

The cloud is going to require some new attributes from data efficiency. First and foremost, data efficiency will have to be completely abstracted from the physical hardware and even the infrastructure that it is running on. Further it even needs to be extracted from the cloud provider, so that the subscriber is not locked into a particular cloud server. Data efficiency also needs to be siloed between each subscriber. While deduplicating across subscribers would add to the efficiency rate the security concern associated with a single process spanning across subscribers data is too great.

Since the data efficiency solution will not know what type of storage it will be applying its algorithms to and it needs to be applied per subscriber, the deployment model will likely need to be applied as a virtual instance for that subscriber with all storage routing through that instance.

A software defined data efficiency solution will need to be able to be installed transparently within the cloud infrastructure, no matter the server and storage hardware being used. This would also mean that the customer would not need to wait for the cloud provider to adopt their own data efficiency solution. In many ways the customer may want to avoid using a data efficiency solution from the cloud provider as it would force them to be locked into their solution, where a software defined solution would allow for greater portability. Not only will the algorithms need to be virtualized but so will the meta-data catalog that data efficiency technologies use to manage the process. If done correctly the data efficiency software should be able to leverage cloud compute to create a massively scalable and highly available data efficient environment.

The ROI of Cloud Data Efficiency

The obvious impact of cloud data efficiency is a dramatic reduction in the amount of capacity being consumed by the same data footprint. This will allow organizations to store more data in the cloud for a longer period of time. But the obvious cost savings of cloud data efficiency is just the start.

Global Cloud Data Efficiency

Other than the actual capacity consumed the other big cost associated with cloud storage is the cost to transfer that data to and from the cloud. While the data efficiency software for hybrid cloud deployments would be software defined in the cloud it could be hardware defined in and on premise data center. Companies that were leveraging the same deduplication engine in and on premise deployment (hardware) would then be able to provide tremendous value to their customers. They could instantiate a software defined version of that solution in the cloud storage provider of the customer’s choice creating a data efficient gateway between on-premises and cloud storage.

With a similar data efficiency engine in place, both at the customer on-premise data center and at the provider’s cloud, an additional payoff would be a massive reduction in the amount of data to be transferred back and forth. The on-site premise could check with the cloud instance to make sure that it is only sending unique data and it could reduce that data before making the transfer.

Affordable Cloud Mirroring

A cloud read data efficiency solution would make cloud mirroring a reality. Ever since the demise of Nirvanix there has been a nagging concern about what to do if a cloud provider failed. Moving all that data, quickly, assuming there was notification, will be a big challenge, as it was when Nirvanix collapsed. One of the most cited solutions is cloud mirroring, where data is copied to two cloud providers at the same time. The problem is of course that this technique instantly doubles the cost of cloud storage and doubles the amount of bandwidth required to keep both cloud copies up to date in real time.

Software defined data efficiency running in both cloud providers platforms as well as on-premises will again reduce the capacity consumption at both providers and also eliminate the amount of data that needs to be transferred between clouds.

Making Everything Cloud Ready

While some backup, archive and replication solutions have data efficiency capabilities built into them most do not. And even fewer have any form of cloud connectivity. Those solutions that do have the capability are siloed to just the data that they process. A global, software defined data efficiency engine would enable almost any application to leverage the cloud in an efficient way and because that single data efficiency engine would span across all of an organizations business processes the effective efficiency rate should improve substantially, which means even less data being transferred and stored.


Because of its permanent nature cloud storage is the most expensive aspect of the cloud, both for subscribers to the service and for the providers themselves. Reducing the storage capacity consumption by leveraging a software defined data reduction technique is a win for both. Not only does the customer save on a monthly cost /GB basis, the service provider can save by increasing their efficiency on storage consumption building a more cost effective business model. It also provides greater protection from a single cloud provider’s collapse and enables more applications in the on-premises enterprise to take advantage of cloud storage.

This independently developed article is sponsored by Permabit.

Permabit Albireo data efficiency solutions deliver inline deduplication, compression and thin provisioning, with industry leading performance, scalability and resource efficiency across a wide variety of platforms and environments. The Albireo family of products is uniquely capable of providing global dedupe across either a single vendors portfolio or a multi-vendor environment, or in the cloud.

Eight years ago George Crump, founded Storage Switzerland with one simple goal. To educate IT professionals about all aspects of data center storage. He is the primary contributor to Storage Switzerland and is and a heavily sought after public speaker. With 25 years of experience designing storage solutions for data centers across the US, he has seen the birth of such technologies as RAID, NAS and SAN, Virtualization, Cloud and Enterprise Flash. Prior to founding Storage Switzerland he was CTO at one the nation's largest storage integrators where he was in charge of technology testing, integration and product selection.

Tagged with: , , , , , , ,
Posted in Article
One comment on “Software Defined Deduplication for the Cloud
  1. Tim Wessels says:

    Well, it is hard to disagree with the notion that customers should use deduplication methods on data they store in public, private and hybrid storage clouds. The savings in storage space achieved could be between 50 percent and 95 percent, depending on the type of data. And within those percentages will come significantly lower storage costs, even after including the cost of the deduplication appliance (hardware and/or software). Mr. Crump has previously commented on the approach to deduplication taken by StorReduce, which is a software appliance that supports S3 and Swift object storage in private or public use cases, and can deduplicate data at breathtaking speeds based on their testing. I agree with Mr. Crump that anyone who requires long term storage of data in the cloud (public, private or hybrid) should evaluate the use of deduplication as a way to significantly lower their storage costs.

Comments are closed.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 21,566 other followers

Blog Stats
%d bloggers like this: