Deduplication, along with compression, provides the ability to more efficiently use premium priced flash capacity. But capacity efficiency comes with at least some performance impact. This is especially true on all-flash arrays where data efficiency features can’t hide behind hard disk drive latency. This has lead some all-flash vendors, like Violin Memory, to claim that an on/off switch on all-flash should be a requirement. Is that the case?
History of the Deduplication problem
When networked flash was initially introduced to enterprise data centers, most of the vendors came from the DRAM storage system market. Vendors like Violin Memory, Texas Memory Systems (now IBM) and Kaminario delivered storage systems that were high on performance but short on features. The concern was that each feature added would lower performance and in their markets performance was everything, or so they believed. These vendors clung to these beliefs as they introduced their first flash systems. They were very high performance-wise, but very sparse on features.
Then vendors like Pure Storage, SolidFire and Tegile all introduced flash systems that were feature rich, or at least had more data management options than the ones mentioned above. The most notable features added were deduplication and compression, which promised to bridge the price gap between flash and HDD. Interestingly, these systems performed at about half the storage I/O rate of the performance focused systems.
Clearly something was different. The focus on affordability lead to a system that had more features and whose hardware was not purpose built for the low latency and high performance attributes of flash.
Does it Matter?
The big question is though, does it even matter? Does the focus on purpose built hardware, with features that are added only when needed, make a difference? If all you do is look at the numbers then the answer is yes. A purpose built flash system without features should be able to deliver 700k+ IOPS in a single unit (i.e. without scale-out). Most feature rich flash storage systems that are built on more commodity hardware are delivering between 250k to 400k IOPS.
So, does it matter? If you have an environment or even an aggregate of multiple environments that require more than 400k IOPS then the answer is a resounding yes. But the reality is the overwhelming majority of data centers don’t need anything close to 250k IOPS let alone 400k IOPS. In fact, most of the data centers that we work with can not generate aggregated performance of more than 50k IOPS. The demand for performance will grow over time, but there is certainly some head room here.
Always on Dedupe? We vote Yes
For most (99%) of data centers, the ability to turn deduplication on and off is a non issue. In fact, a case could be made that since it won’t impact your performance experience, then leaving deduplication and/or compression turned on is a must. This is because in almost every environment there will be some efficiency gain. Also in order to generate the I/O requests required to make an all-flash array sweat, we need aggregate workloads. Some of those workloads will deduplicate and others may not. It is simply easier not to have to separate these different workloads and let deduplication work if it can, and just be background noise if it can not.
Dedupe will get Better
The other mistake that the anti-deduplication crowd is making is they are working under the assumption that deduplication as a technology won’t get any better. This, again, is simply not the case. Deduplication will get better, dramatically so. It will get better at identifying duplicates and will be able to verify redundant data much quicker.
A case in point is Permabit; their deduplication appliance can perform inline deduplication at a rate of 180k IOPS and it is only limited by the CPU inside the appliance. So, even if they do nothing with their software code, their deduplication rate will get faster as Moore’s law continues to work its magic on processing power.
Maybe your Dedupe is bad?
This is not to imply that vendors who provide the ability to turn deduplication on and off have an inferior product, they just took a different path to the market. They typically started with a performance rich but feature poor solution and then added features like deduplication and compression. This gives them the ability to service the performance fringe. The 1% of data centers that actually need that performance will appreciate the capability, the other 99% won’t care.
Storage Swiss Take
First, the overwhelming majority of workloads will benefit from deduplication and/or compression. We have also seen significant value in these two data efficiency technologies when they are integrated and functioning in tandem rather than operating as stand alone processes. Second, for most data centers the answer to the dedupe on or off question is a resounding “doesn’t matter”. Their performance demands simply are not going to push any all-flash array, regardless if it has deduplication or not. Finally, for environments where extreme performance does matter, that purchase is likely to be an entirely separate decision well outside of the purview of core IT.