Those considering the purchase of a storage system that advertises deduplication as a feature need to know what real inline dedupe is, because it matters quite a bit. According to the online SNIA dictionary, there are two types of deduplication: inline & post-process. Inline dedupe is “data deduplication performed before writing the deduplicated data.” Post process dedupe is “data deduplication performed after the data to be deduplicated has been initially stored.”
This is a binary condition, kind of like pregnancy. One cannot be a little bit pregnant. A product is either inline or it is not. Either a product dedupes the data before it writes it to storage or it dedupes it after they write it to storage. The problem comes from some negative brand equity that comes from the term post process – especially from the backup world. While you can make a solid case too that the post process architecture is clearly superior in many situations, its inefficiencies –such as the requirement for a large landing zone and the extra I/O it requires – proved easy targets for those marketing inline solutions, and so a lot of people feel that inline is better.
Fast forward to primary storage dedupe, where the advantages of inline dedupe become even more pronounced. People don’t want to buy an extra shelf of high-priced flash for a landing zone, and they don’t want to have to schedule large imports to the landing zone the way you can do in a backup system. They also don’t want to schedule the dedupe process.
The first vendor to offer primary dedupe, NetApp, chose to implement a post-process approach because it fit well within the OnTap architecture. The landing zone was actually part of the volume so it didn’t require extra configuration planning from that perspective.
Other vendors chose to implement and market an inline approach. However, there is at least one vendor with a post process architecture marketing what it does as inline. This blog post will not mention the name of the vendor for two reasons. The first is that it is not the point; the point is to educate everyone about the differences and why it matters. The second is that Storage Switzerland did not research the details of the dedupe architectures of all primary storage vendors, so it would be unfair to expose only one vendor if there are others doing the same thing. (Having said that, the vendor in question claimed that everyone does what they do and that is definitely not the case. There are true inline vendors in primary storage.)
In the backup space, it’s easy to tell the difference between inline and post process products. Post process products require a large landing zone and require administrators to manage the scheduling of the dedupe process. It must write each block of data that needs deduplication to the landing zone, read from the landing zone, and then possibly go to its final destination – for a total of two-three IOPs for each block needing deduplication.
By contrast, an inline product dedupes a given block before it leaves primary memory, which tends to be NVRAM to protect the data in case of power outage. The dedupe process runs and the decision is made as to whether or not it is new or a duplicate. If it is new, it is written to storage (one IOP). If it’s a duplicate, it is discarded (zero IOPs). Either way, metadata is stored – for a total of zero to one IOP for each block to be deduped.
The challenge comes when looking at primary storage dedupe. If a vendor dedupes the data before it leaves primary memory, then it is clearly inline dedupe. But what if it transfers the data to flash before it makes the decision? It’s still not inline, and the reason it matters is the number of IOPs that the post process generates.
If the un-deduped data is first transferred from primary memory to flash, that is an IOP. It will then require another IOP to read it into primary memory from flash when it is time to dedupe the data. If the data is unique, it will require another IOP to store it in its primary location. That means that this approach will require 200 percent more IOPs than an inline approach. The facts that the landing zone doesn’t need to be very big and the process doesn’t need to be scheduled are irrelevant. What is relevant is that the box will have less IOP capacity than competing solutions. In fact, the vendor whose architecture inspired this blog post says they have a 75 percent reduction in IOP capacity if you turn on dedupe. If every IOP actually requires three IOPs, that’s exactly what would happen.
A true inline solution plays somewhat of a push-pull game between CPU cycles and IOPs. While inline dedupe requires more processing power to decide if a block needs to be deduped before it is written, it comes with the benefit that there are a lot of blocks that don’t get written at all.
VMware uses the post process approach in the VSAN 6.2. Those creating the product knew it was not inline, but didn’t want to call it post process. So VMware chose to call it nearline dedupe. While it would have been preferable if VMware stuck with the official SNIA term – and not used another SNIA term that already means something else – at least VMware was honest enough to not call it inline. When asked about this, Cristos Karamanolis of VMware joked that he thinks that might have spent more time arguing about the term than the company did developing the product.
This post makes a similar point to the Backup Terminology Matters post: it matters what we call things. If vendors are going to differentiate based on certain features, it’s very important that we agree on what those features are called and what you must do in order to qualify for those features. It’s also important that prospective customers understand what those features are and do their due diligence. What do you think?