Standard deduplication is the elimination of redundant data on a single storage system. Whether that system is used for backup or primary data, the goal is to put as much data as possible on a single storage system so that the deduplication algorithm can “see” as much data as possible. The more data the storage system stores the better the chance of finding redundancy and the better the effective efficiency rate. Global deduplication essentially expands that effectiveness across multiple storage systems.
A Closer Look at Global Deduplication
There is plenty of confusion over exactly what global deduplication is. As a result there are many vendors that claim to have global deduplication when in most cases they are using the term loosely. For example some vendors claim that they have global deduplication because they have a scale out storage system and that data is optimized across the entire storage cluster. While this could be beneficial to some data centers this is not true global deduplication, it is simply taking advantage of the way a clustered file system works.
Global deduplication spans across all discrete storage systems in a given / defined environment, providing optimization across all of them. Eventually global deduplication will be able to span disparate storage systems, storage vendors and even geographic data centers. This capability is critical when the reality of the modern data center is considered. Most data centers have multiple storage systems for specific use cases, all from multiple vendors. Having common global deduplication functionality could deliver numerous benefits.
Global Deduplication Hierarchy
Global deduplication on primary storage will need to provide options as to how it is implemented. In theory, deduplication could be applied to all primary storage systems. If a segment of data exists on storage system A, then there would be no need to store a copy of that segment on storage system B, so the deduplication engine would just link storage system B to storage system A for that specific data set. The space savings of this type of implementation could be enormous. But while deduplication itself can be implemented without performance impact, the read latency caused by fetching segments across storage systems through a storage network could be significant enough to impact performance.
The solution would need to provide the ability to control how that deduplication occurs. There would be a need for a deduplication hierarchy. For example an all-flash array may want to leverage deduplication for its data, but not want to have to fetch segments across a storage network. But it may be perfectly okay, even preferred for a hard disk array to fetch some data segments from the all-flash array.
The same holds true for a remote copy of data. Deduplication should be leveraged as part of a WAN optimization process; more than likely there will be a need to have the data intact, but optimized, at the remote site. In other words have the data deduplicated within the site, but with no requirement to fetch data remotely.
Global Deduplication Implementation
There are a couple of methods for global deduplication to be implemented. First, a vendor could integrate the same deduplication code on each of its storage systems. While this may not deliver the “any vendor” benefit, it would allow for optimization across systems. The deduplication engine would need to provide a master meta-data management capability that would track data segments across the storage systems. Considering that most vendors are just now providing primary storage deduplication on a single system, it may take at least several years for vendors to achieve this level of integration across their storage portfolio.
The second method would be to not do the deduplication work in the storage system itself, but instead to have a physical appliance that runs the deduplication code for the enterprise. This approach opens up the possibility of cross vendor deduplication. It is also more likely that a company focused on deduplication could deliver the flexible control of the deduplication process described above. This approach has the best potential of being available sooner rather than later.
The Benefits of Global Deduplication
The first obvious benefit of global deduplication is the further optimization of storage capacity. With the controls described above efficiencies could be driven to new levels without much performance impact. But the dollar savings are just the beginning.
The second benefit is efficiency in data movement, whether it is between storage systems or to another facility, potentially even the cloud. If a data set needs to be operated on at another data center in the cloud or even moved to another storage system, only the data that is unique would actually need to be copied.
Imagine a test-dev environment that gets a daily update of data from a primary storage array. With traditional deduplication, data that already exists on that array would not have to be written, saving write I/O time and RAID calculation time, but the data still needs to be sent across the network to the test-dev storage system. For most environments the network transfer speed is a much bigger time consumer than the time to write data to disk. With global deduplication the two systems could communicate first and compare meta-tables, and then only the data that is unique would be transferred, significantly reducing test-dev data refresh time.
This savings would be true of almost any data transfer, even moving a virtual machine from one storage system to another, to a remote facility or to the cloud. Essentially the higher the network latency the greater the payoff that global deduplication can deliver.
Global primary storage deduplication is just around the corner and its benefits go far beyond getting more capacity per drive. The efficiencies in moving data between storage systems and data centers could open up a whole new era in data mobility. How soon can vendors really hope to implement these techniques? The reality is it may be years before they are realized. Instead users need to look to vendors that can provide data efficiency technology that is an adaptable engine that can be deployed across a variety of storage vendor solutions.
Sponsored by Permabit
Permabit Albireo data efficiency solutions deliver inline deduplication, compression and thin provisioning, with industry leading performance, scalability and resource efficiency across a wide variety of platforms and environments. The Albireo family of products is uniquely capable of providing global dedupe across either a single vendors portfolio or a multi-vendor environment, as discussed above.
Within a portfolio, global dedupe is accomplished by storing meta-data in a distributed grid, accessed remotely by data storage systems and storage applications.
For multi-vendor environments, our SANblox appliance can be used in front of storage-independent enterprise virtualization products to deliver data efficiency in multi-vendor, multi-tier environments.
As global storage efficiency models become requisite, Permabit has the technology and expertise derived from experience with over 10,000 installations to deliver global data reduction with massive scalability, high performance and extreme efficiency.