Is QoS Enough to Ensure Virtualized Application Performance?

Posted on January 8, 2014 by Colm Keegan

Maintaining adequate storage performance becomes increasingly critical as IT organizations accelerate their server virtualization deployments. With hundreds, or in some cases, thousands of virtual machines simultaneously accessing a limited pool of shared storage resources, the odds of encountering increased storage I/O latency are all but inevitable. To combat this issue, some storage array manufacturers are offering storage Quality of Service (QoS) capabilities with their offerings. However, will array based software QoS be enough?

There are several storage infrastructure components which impact virtual application QoS:

Speed/bandwidth of the Storage Network

Speed of the Storage Controller

Speed of the Storage Media

Network Controlled QoS

There are ways for storage infrastructure managers to tweak each of these layers to prioritize storage I/O requests for certain hosts. For example, at the storage network layer, the bandwidth on host network cards can be configured to prevent too much storage I/O traffic from emanating from any given server. Likewise, ports on network switches can be configured to provide higher buffer credits to specific ports to ensure that certain servers have larger “swim lanes” to the storage they’re connecting to.

Storage Controller QoS

At the storage controller layer, some vendors provide administrators with the ability to assign a certain percentage of controller CPU to a volume or LUN. This is designed to ensure that those volumes or LUNS will have their storage I/O processed more rapidly than other applications sharing the storage array.

Storage Media Accelerated QoS

From a storage media perspective, storage architects can now deploy All-Flash storage arrays to ensure that all application I/O will be serviced rapidly by back-end storage resources. Hybrid arrays, on the other hand, offer a more economical approach to all flash systems by mixing flash capacity with conventional disk drives. Some hybrid arrays include storage tiering software which automatically moves active data from slower disk resources to faster disk capacity like flash or SSD. These different types of storage media provide a semblance of QoS by placing a volume or a data subset on faster media.

QoS Complexity

Now, let’s examine some of the pitfalls of these approaches. First, setting bandwidth priorities at the host network card or network switch port layers can present operational management challenges as all these settings have to be manually managed, tracked and re-adjusted as virtual workloads migrate and application priorities change. Not to mention that since these approaches only work on a per host basis, they cannot be fine tuned for individual VM workloads; nor does that tuning follow the VM as it migrates to other hosts. Consequently, this does not provide the granularity IT organizations need to accommodate the dynamic nature of growing virtualized environments.

Disk Controller I/O Congestion

Secondly, the fundamental issue with storage array based QoS, whether in All-Flash, hybrid or traditional storage arrays, is that it still requires a fixed controller architecture to manage and prioritize inbound read/write I/O requests from a multitude of virtual machines (VMs). Ultimately, the storage controller itself becomes the I/O bottleneck; regardless of how much QoS intelligence it has.

It is possible to work-around the limitations of dual controller systems by placing limits on the number of hosts or VMs that can attach to these resources, however, this generally requires deploying multiple storage systems to accommodate virtual environments as they grow. The problem with this approach is that it increases both capital and operational costs and limits the ability of the environment to fluidly scale and stay agile.

Lastly, from a storage media perspective, unless an organization is prepared to make an All-Flash investment, there is always the risk of a “cache miss” in traditional and hybrid storage array environments; even when automated storage tiering is in place. During a cache miss event, application I/O performance will go from flash performance response times to more latent conventional disk speeds. Even if these events are relatively rare occurrences, they can be extremely disruptive if it impacts a mission critical application. Moreover, even when the budget exists to make an All-Flash array investment, these architectures are still prone to the controller I/O limitations discussed above.

Hypervised QoS

Fundamentally, the inherent flaw with all of the above performance design measures is that they take a “bottom up” approach for establishing QoS. In other words, they prioritize storage I/O only AFTER they have been received from the requesting host or VM.

What is needed instead is an end-to-end methodology for mapping the storage I/O performance requirements at the host or VM level, to the corresponding storage resource on the back-end. In this manner, application QoS can be granularly tuned BEFORE the I/O leaves the VM.

In fact, hypervisor vendors are increasingly providing storage software services, like snapshotting, thin provisioning and data replication, directly into the hypervisor operating system itself. This allows for more fine grained control of storage resources as they can now be managed per each individual guest OS, rather than at the disk array LUN/ volume level.

Likewise, storage QoS might be better managed if it were integrated at the host or hypervisor layer rather than at the disk controller layer, for a top down approach to storage management. In this architecture, the host would provide the storage controller functionality and serve up access to the appropriate disk storage resources based on the profile of the underlying VM. In this manner, storage performance could be tuned at the VM level BEFORE storage I/O requests leave the requesting application.

Virtualized Resource Control

For example, the host could have access to a mix of high performance storage resources like PCIe Flash and drive form-factor SSD, along with conventional hard disk drives, and allocate VM access to this media based on pre-defined QoS policies. A mission critical SQL database might be assigned a Platinum QoS service-level and be granted access to PCIe Flash capacity, while a VM providing access to user home directories may have Silver level QoS and be given access to hard disk drive storage.

A mix of storage QoS policies can exist within the same host and profiles could easily be adjusted to allow applications to gain access to faster storage resources as workloads change. As importantly, with this architecture, by moving the storage controller functionality out of the array and into the server, storage planners can scale-out storage performance as hosts are added to the environment. This is a cost-effective, “just-in-time” approach to achieving “n +1” cloud compute and storage performance scaling in the data center.

Grid Based QoS

Since storage capacity would not be locked into the servers, it could scale independently from the server/hosts. Furthermore, the environment would also be highly resilient as each independent server node would have an awareness of its peers in the environment to form a grid based storage infrastructure.

Data could be striped across nodes to ensure data resiliency and redundancy. This would also help foster VM migration as workloads could easily be ported across server nodes to accommodate applications as their storage I/O performance requirements change.

This also enables organizations to leverage the x86 resources in the data center to achieve even higher levels of compute and storage density.

Conclusion

Traditional storage systems have some inherent architectural limitations when they are integrated with virtualized server infrastructure. Their dual controller designs were not intended to manage the simultaneous storage I/O requests of hundreds of virtualized applications. Consequently, providing disk controller layer QoS may prove to be “too little/too late” in growing virtualized environments where the disk controller itself IS the storage I/O bottleneck.

Host based storage software solutions, like those from Gridstore, on the other hand, provide storage planners with the flexibility to leverage existing x86 server resources to deploy virtual storage controllers directly into their physical servers or Hyper-V server hosts. By doing so, storage administrators can set storage QoS based on the underlying application or VM to ensure that the appropriate storage I/O resources are always available for that individual business system.

This gives infrastructure planners the flexibility they need to start small while retaining the ability to scale-out large storage grid environments, which can scale with business application growth, while utilizing existing x86 resources and cost-effective commodity storage resources. Critically it provides the fine grained QoS per VM that allows admins to deliver the storage resources needed for their most important VMs.

Gridstore is a client of Storage Switzerland

Click Here To Sign Up For Our Newsletter

About Colm Keegan

As a 22 year IT veteran, Colm has worked in a variety of capacities ranging from technical support of critical OLTP environments to consultative sales and marketing for system integrators and manufacturers. His focus in the enterprise storage, backup and disaster recovery solutions space extends from mainframe and distributed computing environments across a wide range of industries.

Tagged with: Gridstore, Hyper-V, I/O, Infrastructure, QOS, Virtualization
Posted in Article

One comment on “Is QoS Enough to Ensure Virtualized Application Performance?”

SteveO says:

January 13, 2014 at 12:40 pm

Good article, Colm. As we consider storage service optimizations in our organization, multitenancy has emerged as a critical component of our next-gen storage architecture. We’ve been looking at various QoS methodologies for 2+ years, and we came to the same conclusions that you have. One emerging vendor has promise in this area – SolidFire. They employ dedupe, which should help manage the costs. They also employ QoS at a volume/LUN level. This seems advantageous, because we can distinguish between tier 0 and tier 1 disk in the same (scale out) box, at the same $/GB.

Comments are closed.