VMware, as it became the standard in the data center, created a huge storage problem, often called the IO blender. Fortunately, as the IO blender problem was about to reach its peak, hybrid and all-flash arrays came to market reducing the IO blender’s impact. But do all-flash arrays eliminate the need for performance analysis and tuning?
IO Blender Refresh
When VMware first established itself as a production quality solution for the enterprise, maximizing its full potential required a shared storage array, which connected to servers running the VMware hypervisor via a storage area network (SAN). On each physical VMware server were dozens of virtual machines all accessing the same storage array at the same time.
To make matters worse, it was the first time most data centers used a clustered file system, VMFS, which enabled critical capabilities like VMware’s vMotion. Now not only were virtual machines all accessing the same storage array they were accessing the same volume or LUN.
The result was more IO than ever going to a single LUN on a single array and, of course, that array was hard disk based. The IO blender was created from the latency of the hard disk drives trying to respond to all of these IO requests. The old performance tricks of applying massive cache and short-stroking drives didn’t help (enough). Hybrid flash and all-flash arrays seemed like the ideal answer. They reduced the latency of hard disks and most environments saw an immediate improvement. But can we declare the death of the IO blender?
The answer is no, the IO blender is alive and well. Flash arrays are not killing the IO blender, they are just moving it. The storage architecture is more than just flash arrays, it is the interconnect between that array and the storage network switch, and the storage network switch and the NICs or HBAs on the physical VMware servers. Poorly behaving apps, hot spots, oversubscribed ports, cache configuration issues, under-provisioned server CPU or memory, imbalanced and bully VMs, rogue clients, suboptimal queue depth settings, out-of-spec hardware components that are about to fail, and several other factors all contribute to performance bottlenecks.
VMware also adds another layer of complexity because storage IO troubleshooting doesn’t stop at the physical machine. IO patterns need to be traced directly to the virtual machine creating them and, most importantly, those patterns need to be correlated between all the other virtual machines.
All-flash Can’t Beat Bad Design
A flash array is essentially three parts; the flash NAND, the on-board compute and the storage software. The NAND is basically the same between vendors. And while compute and software efficiency can significantly impact performance most of the performance influencers are outside of the purview of the array. A flash storage system insures that it can respond to IO requests as fast as it gets it. But once those IO requests leave the storage system, the flash array is just as dependent on the quality of the infrastructure as a hard disk based system is. For instance, if your multi-pathing fails or fails to handle your workload, flash won’t help. At least one vendor of hybrid and all-flash storage systems reports that over 50% of reported performance problems are caused by non-storage-system factors.
In most cases, a hybrid or all-flash array will deliver some level of improvement but getting maximum return on the flash investment requires maximizing the components between the application and the flash NAND.
Performance Tuning 101
There are components of troubleshooting that remain unchanged even with the introduction of mission critical, production VMware. The physical connectivity between the storage system and the switch, the interconnectivity between switches and the connectivity to the physical servers all need to be examined, and configuration settings like queue depths, that are application dependent.
The physical part of most storage infrastructures, the switches, host bus adaptors and the cabling itself is generally upgraded gradually. New components are expected to work alongside components that may be five years old or more. Even optical cable infrastructures upgrade gradually as higher performance components demand better quality and lower light loss.
Because of this constantly evolving design, verification of these connections can’t be a one-time or even a periodic event. Data centers need to continuously monitor the quality and behavior of the infrastructure to make sure it is configured correctly and gets proper upgrades or reconfiguration with the implementation of new components.
There are countless stories of second generation switches being hard set to be compatible with first generation switches, and then not getting reconfigured to be more compatible when third generation switches are implemented. Every change or upgrade to the infrastructure comes with a strong possibility of something not working properly.
Proper monitoring will also allow IT to know if an upgrade to an all-flash array will require a new storage network, network component, or software configurations. In many cases, intelligence gained by monitoring analysis enables the current network to be tuned to eliminate the need for a complete upgrade, stretching a few more quarters or even years from the architecture, which, of course, saves the organization substantial money.
It is also not unusual for the cabling infrastructure to degrade over time, even though most data centers have moved to optical connections. The problem is the connectors at the end of the optical connections are very susceptible to degradation as they are plugged in, unplugged and replugged. Real-time, on the wire, analysis of signal quality is required to make sure that these infrastructures don’t create intermittent performance problems. This issue is far more common than most people realize, with the vast majority of data centers suffering from dirty optics, without knowing the cause.
Performance Tuning 102: The Abstracted Data Center
Virtualized environments like VMware add another layer of complexity when it comes to performance tuning, the abstracted nature of virtual machines. Each virtual machine on the physical host shares that host’s CPU, memory and network connectivity. It is critical for infrastructure managers to know the performance profile of each virtual machine. That knowledge will help them make sure the IO load is properly balanced across the physical servers.
Understanding a VM’s IO profile will also help them make wise decisions on where to move a virtual machine and how the virtual machine will impact other virtual machines on the target physical system.
Another advantage of understanding VM IO profiles is determining which VMs will actually benefit from flash storage is much easier. Most organizations buying an all-flash array don’t throw their hard drive arrays in the trash. If a VM can’t benefit from being on an all-flash array then there is little reason to move it there. With a thorough understanding of VM performance requirements, the organization should be able to reduce the initial investment in all-flash arrays and curtail the system’s growth. Utilities like VMware’s vMotion make moving VMs between an all-flash array and a hard disk array relatively straightforward.
The problem is that understanding the IO profile without a simultaneous understanding of how the infrastructure is responding to those profiles only tells half the story. The real-time knowledge of VM IO profiles needs coordination and correlation with a real-time line level understanding of the infrastructure itself.
The modern data center is a constantly evolving entity. New virtual machines are brought online daily and infrastructure changes happen on a regular basis. In addition, existing application profiles can change as they either become more trusted in production or begin to lose popularity. All of the changes impact the storage IO workload profile of the data center. The ability to monitor that workload profile so that problems caused by changes can be intercepted before they impact performance will eliminate the fire drills that interrupt the administrator’s day.
Performance Tuning 103: Know What The End of Your Rainbow is
A final aspect of performance tuning is performance prediction. What would it be worth if IT could test the effect of increasing workloads ahead of time and know, today, exactly when performance will begin to exceed service level agreements (SLA)? Monitoring and analysis gives details on current workloads. But it also prepares the IT manager to define how far into the future the infrastructure will take the organization and what parts of the infrastructure will need to be replaced first.
This process is called workload modeling, which allows an organization to load an appliance with its IO workload profile and then test new storage systems against their current workload. In addition, it also allows the organization to “turn up the dial,” so to speak, and look at how far the current or tested system could take the organization into the future. Understanding if the storage infrastructure can support 2X, 5X or even 10X the current workload is essential to cost-effective planning.
Solutions from companies like Virtual Instruments connect these two pieces of the puzzle. The monitoring tools can feed IO profiles directly into the workload modeling and load generation tools so the organization can exactly replicate their IO environment, without having to spend money on building an identical lab.
The modern VMware data center is constantly evolving and all-flash arrays are only part of the solution. With the introduction of next generation servers, VM density will increase, which creates an even higher amount of IO randomness. It will also require higher performance network interface cards and switches but those will need to coexist with previous generation cards and switches.
Without a monitoring capability, IT managers are left to a “best guess” approach to performance troubleshooting, with almost no way of forecasting future performance needs. The combination typically either leads to overbuying on hardware or constantly having to deal with unexpected performance shortfalls. It’s the difference in being proactive about the infrastructure as opposed to constantly being reactive to problems.
The solution is real-time and continuous monitoring of the environment that allows storage architects and engineers to design an architecture that is not undersubscribed but is verifiably safe to run the organization’s most critical application workloads. Add the ability to predict, quarters in advance, when the current architecture won’t support organizational growth and demands, and IT can safely buy for today and have a plan for tomorrow.
Sponsored by Virtual Instruments