What is Global Inline Deduplication?

Posted on February 23, 2023 by George Crump

If you are deploying a private cloud that spans multiple data centers or hundreds of Edge locations, understanding the various forms of deduplication is critical to selecting the infrastructure for these initiatives. Understanding what you will get from each vendor’s deduplication technology is challenging, as they use similar names but offer different capabilities or unique marketing names that have nothing to do with the technology.

This article will examine several of the more popular deduplication techniques on the market.

Technique	When	Impact	Savings	Dev Effort
Post Process	After Data is Written	Measurable	Good	Easy
Inline	As Data is Written	Significant	Better	Heavy
Global Inline	Before data is sent	Minimal	Best	Significant

Deduplication Technique Comparisons

Post Process Deduplication

Post-process deduplication is the “after-thought” deduplication algorithm. It is the hallmark of a company that needed to factor deduplication into its initial development effort but didn’t. Instead of integrating deduplication into their storage software’s core code, they created a separate module and process.

The technique receives data to storage, sometimes in a particular area of storage, and then later, when I/O activity is minimal, it will perform the comparisons necessary to remove redundant blocks. It requires two storage states, one where data is not deduplicated and one where it is. Today it is primarily used in backup storage.

Vendors using this technique will claim it is superior because they do not execute a complex algorithm during peak I/O load. While it is true that the algorithm does not run until there is lower I/O activity, they are storing all the data even it that data is already stored. In addition to writing the data, the storage software must manage all the other aspects of writing data like:

calculating parity for drive-failure protection
updating snapshot trees
replicating the data to a remote site.

If that media happens to be flash, each write puts the media one step closer to wearing out.

Inline Deduplication

Inline deduplication compares the new data to existing data as the storage software receives it before writing it to the storage media. It offers the advantage of only writing unique data, which means that some percentage of new writes never occurs. The storage software doesn’t need to update the drive-failure protection algorithm, update snapshot trees, or replicate that data to a remote site. It does, however, have the overhead of executing a reasonably sophisticated algorithm in real-time, so the quality of the deduplication algorithm and the rest of the storage software’s code becomes relevant.

Inline deduplication only eliminates some data redundancy in the infrastructure. Traditional three-tier architectures limit the comparison to the storage system that stores the data. If the organization has multiple storage systems, which most do, and if there is redundancy between those systems, the software will need to write the data multiple times. Hyperconverged infrastructures (HCI) and virtualized storage software will only deduplicate within like nodes. For example, if there is a cluster of high-performance NVMe flash nodes and a cluster of mid-ranged performance SAS flash nodes, there will likely be redundant data between those two clusters.

A virtualized environment has the potential for the highest return on the deduplication algorithm, and it is hard to imagine not leveraging deduplication in those use cases. Many virtualized infrastructures require more than one storage system to meet the needs of the workloads it supports. A lack of global deduplication across dissimilar storage systems, means potential capacity savings are lost.

The lack of global deduplication also means the technology has limited advantages when used in a multi-site environment. An organization with multiple similar workloads at remote sites will need to send the data from each site despite similar data between the sites. If the remote sites replicate to a shared storage system at the primary data center, it will only store one copy of the data, but each site still sends redundant data over the WAN. The result is that inline deduplication wastes bandwidth and increases the time it takes to protect each Edge or remote site.

Global Inline Deduplication

Global inline deduplication is similar to inline deduplication, which looks for data redundancy before storing it. Again, the quality of the deduplication algorithm is a critical factor. Global inline deduplication adds the ability to check redundancy across either multiple storage systems or in the converged infrastructure design across multiple dissimilar clusters. It also optimizes WAN transfers and will ensure that if the primary store has a copy of the data, it will not send it again. As a result, replicating dozens or even hundreds of remote locations is extremely fast.

Global deduplication provides the maximum level of data optimization for the organization, ensuring that it only stores one copy of data (plus a protected copy and potentially a DR copy). The result is reduced overall storage capacity costs and WAN bandwidth investments.

Determining Deduplication Efficiency

There are two elements of deduplication efficiency. The first is how accurately it will detect redundant data, and the second is measuring the performance impact of deduplication on the overall I/O performance of the storage infrastructure. Both impact the cost. If deduplication does not accurately detect redundancy, then the customer will need to buy more capacity than they would have to with another solution. If the algorithm is inefficient, then the customer will need to pay for more RAM and processing power than they would need for a storage solution with a more efficient algorithm.

The challenge is measuring efficiency during a proof of concept. While IT can load a data set onto the storage system, it is hard to simulate real-world reduction and the impact of that reduction on performance. Before testing, examine the history of the vendor’s deduplication feature. Did it come with the product from the initial shipment, or did it show up years later? If deduplication was an afterthought, then the chances are high that the algorithm will impact I/O performance.

Another sign to look for is in the vendor’s knowledge base. Do they publish cautions to consider when using deduplication? For example, if they tell you to turn it off for specific situations, there is a good chance that the algorithm is inefficient. One vendor suggests a different form of drive failure protection if the customer uses deduplication. Another recommends not to use deduplication with high-capacity nodes. (let that marinate for a moment)

Conclusion

Most deduplication solutions ask you to pay a heavy tax when using deduplication:

More Powerful Hardware
All-Flash Only
A higher cost per TB license cost
Don’t provide the full dollar value of the deduplicated capacity.

VergeIO is an ultraconverged infrastructure solution. Its product VergeOS integrates global inline deduplication into the core code and has been there since day one. Integration into the code from the start makes it very efficient and repeals the deduplication tax. Customers can use it without impacting performance while enjoying one of the industry’s best deduplication ratios. Because it is global, the customer also benefits from maximum space efficiency, dramatically reducing the capacity cost, especially for flash-based storage. The global capabilities also provide a WAN-optimized transfer so that no data is sent over the WAN more than once, making it ideal for Edge locations.

VergeOS Enables Edge Computing and Private Cloud.

If you are searching for an alternative to VMware vSAN, Hyper-V, or a better option for your next storage refresh, then VergeOS is worth a look.

Join me today for a virtual CxO roundtable with VergeIO Founder and CTO, Greg Campbell and CEO, Yan Ness as we take questions on creating Edge Computing and Private Cloud Infrastructures. Register Here.

Set a meeting with me for a virtual whiteboard technology session.

About George Crump

George Crump is the Chief Marketing Officer at VergeIO, the leader in Ultraconverged Infrastructure. Prior to VergeIO he was Chief Product Strategist at StorONE. Before assuming roles with innovative technology vendors, George spent almost 14 years as the founder and lead analyst at Storage Switzerland. In his spare time, he continues to write blogs on Storage Switzerland to educate IT professionals on all aspects of data center storage. He is the primary contributor to Storage Switzerland and is a heavily sought-after public speaker. With over 30 years of experience designing storage solutions for data centers across the US, he has seen the birth of such technologies as RAID, NAS, SAN, Virtualization, Cloud, and Enterprise Flash. Before founding Storage Switzerland, he was CTO at one of the nation's largest storage integrators, where he was in charge of technology testing, integration, and product selection.

Tagged with: Deduplication, Edge Computing, What Is
Posted in Blog