The dramatic growth of data continues. To help solve the problems associated with that growth, vendors are coming up with new technologies that will not only maximize storage capacity but also improve performance. These techniques include thin provisioning, writeable snapshots (clones), and most notably compression and deduplication. IT refers to them collectively as “data efficiency” and they have now made their way into into hyperconverged systems. But, because of its shared nature, IT professionals need to consider data efficiency differently when it is used in a hyperconverged infrastructure.
What is Hyperconvergence?
Hyperconvergence is the consolidation of storage, compute and networking into the cluster of physical servers that run virtual machines, applications and storage software. The traditional three tier architecture dedicated processors for everything, where most hyperconverged architectures share the available compute across a wide variety of functions including data efficiency.
Data Efficiency in a Hyperconverged World
By adding data efficiency techniques like deduplication and compression hyperconverged infrastructure vendors are making a CPU investment expecting a return in storage I/O and capacity savings. But because of the shared everything nature of the environment, IT professionals have to carefully consider how the data efficiency processes consume compute resources.
Data Efficiency Requirements
Storage efficiency and the management of the metadata is CPU intensive. These compute challenges are compounded in the hyperconverged environment because a spike in storage I/O typically coincides with an increase in compute power consumption, but that I/O also places more demand on the data efficiency process. As a result, IT needs to consider the compute investment in hyperconverged architectures more than in dedicated systems and tiers. A runaway process could have a ripple effect that impacts every aspect of the environment.
Managing the Performance Impact of Hyperconverged Data Efficiency
Data efficiency comes in several forms: thin provisioning, compression and deduplication. To be most effective the hyperconverged system should employ all three methods because each brings its own unique benefit to the architecture. The problem is that executing all three of these techniques requires careful management of the CPU resources. While at one time they may have been plentiful, many now are constrained.
Thin provisioning and compression by themselves have a lighter impact on CPU resources than does deduplication. But deduplication has the biggest potential payoff in terms of increased efficiency. Typically, deduplication will yield a 5:1 efficiency gain on active primary data and if you apply the technique globally across active, inactive and backup data, the efficiency gain can rise to greater than 40:1. With these types of returns, the investment in deduplication is worth the effort. But that investment cannot come at the loss of performance nor can it put data at risk.
Mitigating the Deduplication Performance Impact
IT can mitigate the deduplication performance impact in a number of ways and each method has its pros and cons. The IT team needs to decide which method will have the least impact to the organization.
1) In-line Deduplication
The first consideration that vendors make is the deduplication type; in-line or post process. In-line deduplication means that as new data is being written to storage comparisons to existing data are made prior to the new data being written to disk. If the data is redundant the process updates the metadata table but no data is written to disk and parity does not need to be calculated and written to disk either. The result should be a reduction in write I/O which is especially valuable when you use flash storage.
The reduction in write I/O can be significant but these comparisons need to happen fast enough that it does not noticeably impact overall storage performance. A massively shared architecture like hyperconvergence, the CPU drain to combine these comparisons with all the other tasks that the CPU has to perform, may show a noticeable performance impact, especially during massive write I/O updates.
2) Post Process Deduplication
To alleviate some of the performance concerns some vendors are using a post process technique to deduplicate at times when CPU resources are slow. The challenge with this approach is increasing write I/O, instead of eliminating write I/O. Not only do all writes land on the storage infrastructure, the post process technique will update metadata tables at a later time if it finds unique data. After that the technique erases redundant data but if that data was on flash storage, flash is erased by writing (zeros) to the media, raising additional concerns about flash durability.
Post process also introduces confusion over how much capacity the organization truly has at any given point in time. It either has to wait until post process scans are complete or it needs to take a rough guess while the process scans run. While post-process solves the CPU problem, it introduces a vast array of storage capacity and I/O challenges.
3) Partial Deduplication
Another technique to save on processor consumption is to not duplicate the entire file but only a partial amount of that file, or files over a certain size. The basic assumption is that larger files are less likely to have redundant data in them. The deduplication software then does not have to burn CPU resources segmenting an entire large file. There isn’t enough available data to indicate whether or not the front part of the file will yield near the same results as deduplication of the entire file. Nor that the CPU savings is significant enough to be worth the expected loss in storage efficiency. It seems that fine tuning the existing algorithm or complementing it with enough processing power so the complete file deduplication is a better return on the investment.
4) Brute Force
A final technique, and potentially the most popular, is the brute force method. Vendors ensure deduplication does not impact performance by making sure nodes are powerful enough to run all the storage management tasks plus normal virtual machine requests. The problem with this approach is that, typically, the spikes in performance are infrequent. Meaning they will impact performance just often enough to be annoying. The brute force approach as a result raises the cost of the environment to address a problem that occurs 10 percent of the time.
5) Dedicated Resources for Data Efficiency
In the end the only way to guarantee effective data efficiency without impacting application performance is to dedicate CPUs or cores to the task. The problem with dedicating CPU cores within the hypervisor nodes is that there is still overhead associated with their use because you have to make requests for that CPU through the hypervisor. While certainly an improvement, a better alternative is to use fully dedicated CPUs that are outside of the hypervisor’s control. That way no performance is lost to hypervisor overhead and you can use a less expensive processor because it only has to contend with a particular task. With a dedicated CPU resource storage performance is more consistent and it can save server CPUs for virtual machines enabling a more scalable architecture.
Conclusion
Hyperconvergence can greatly simplify the IT infrastructure. It empowers IT to be more flexible and responsive to the needs of the business. But there are situations when hyperconvergence’s “shared everything” nature requires greater consideration as IT implements its features and capabilities. Data efficiency is a great example. Organization’s need to carefully examine how vendors implement these features to gain not only maximum benefit but to do so with minimal impact and maximum data protection.
Sponsored by SimpliVity
