The term “noisy neighbor” in the storage context refers to a rogue virtual machine (VM) that periodically monopolizes storage I/O resources to the performance detriment of all the other VM “tenants” in the environment. This phenomenon can become more pervasive as VM density per host increases in data center environments. Unless the proper safeguards are implemented, predictable application performance becomes very difficult to attain, resulting in end user dissatisfaction and a loss of confidence in IT’s ability to provide a reliable service.
A FIFO Free-For-All
In addition to sharing physical server systems, virtualized environments also share a common pool of storage resources. This helps to improve efficiencies by increasing utilization rates of all the available server and storage assets in the data center. Also, a shared storage infrastructure is required to enable capabilities like live server migration and automated resource management. Typically, disk resources in a storage array are configured into multiple LUNs (logical unit numbers) and then each LUN is assigned to multiple virtual machines (VMs). For simplicity and further optimization, most virtual environments will put dozens of VMs on a single LUN instead of dedicating a single LUN to a VM. Doing so would lead to the creation and subsequent management of hundreds of LUNs.
Traditional storage systems utilize a dual controller architecture. The controllers manage and queue up all the storage read/write I/O requests coming from the VMs and facilitate the I/O operation through the storage devices on the back-end and promote them up to the requesting server on the front-end. These I/O requests are typically handled on a first-come, first-served basis, otherwise referred to as first-in-first-out (FIFO).
Prior to the introduction of server virtualization, this process generally worked well. Storage administrators may have had a dozen or so physical servers at most, accessing the same storage system and each often had its own dedicated LUN. Since there were a relatively smaller number of servers using a single storage array, performance issues could be readily identified, isolated and remedied relatively quickly.
Now thanks to server virtualization, there can literally be hundreds of virtualized applications across dozens of physical hosts using that same storage system. The end result is a “storage free for all” where a much larger number of VMs are competing for access to storage resources. Again, often many of these VMs share the same physical storage LUN. The issue is typically not a lack of storage capacity at the array level but rather a storage I/O choke-point at the disk controller layer and the disk media layer.
The Greasy Spoon and the Rogue VM
A good analogy would be the cashier line at a fast food hamburger joint. Think of the cashiers as the disk controllers and the guys flipping burgers and making french fries as the disk devices.
There could be 10 guys working the grille and the fryer but if there are only two cashiers waiting on 50 customers, it will take more time to process the orders at the register than it will actually take to make them in the kitchen.
If this problem wasn’t bad enough, now imagine what would happen if one customer suddenly ordered $10,000 worth of hamburgers. The entire operation would unfold as all the available cashier and kitchen personnel resources would have to plow through that massive order before the next customer could be served. Imagine how irate you would be if you were the next person in line. No doubt, at least half the customers waiting in line would just leave the restaurant and go somewhere else; resulting in lost business and a major hit to the reputation of that local franchise.
Like the hamburger stand example above, all it takes is one rogue VM to make a house of cards out of the shared storage environment. For this reason, many organizations have avoided virtualizing their Tier-1 business applications. From their perspective, the risk of subjecting critical business systems to noisy neighbor conditions is simply not worth it.
Work-Arounds That Don’t Work
How have some infrastructure planners responded to managing noisy neighbor issues? Some storage planners have resorted to over provisioning storage resources to ensure that critical business systems wont be impacted by the emergence of a rogue VM. This includes short-stroking disk drives, (the process of configuring physical disks to only use the outermost portion of the platter to speed up I/O), limiting the number of VMs assigned to a given storage LUN or storage platform and, of course, installing high speed Flash and SSD either into or alongside an existing array.
While these measures can help to enhance storage performance, they are tactical in nature and are costly to implement. What’s more, they fly in the face of infrastructure consolidation and resource virtualization initiatives. Lastly, they can’t scale to meet the growing data demands of today’s IT environments.
So then, what action can be taken to mitigate or rule out the risk of the noisy neighbor syndrome? Some storage manufacturers have attempted to address this issue by implementing storage quality of service (QoS) intelligence at the disk controller level. This allows administrators to assign a fixed number of IOPs on a per VM basis.
While this may help to some degree, it places all intelligence within the controller without any awareness of what’s taking place at the VM layer. Moreover, it still doesn’t address the storage controller bottleneck issue. In short, all storage I/O still has to pass through the dual controller front-end. Noisy neighbors may be held at bay through controller based QoS but it does not eliminate performance issues related to resource contention, not to mention, it doesn’t satisfy the I/O requirements of the application generating all the I/O requests.
Pinning Down The Problem
I/O hungry applications that are prone to morphing into noisy neighbors can be effectively isolated in the environment in several ways. First, if it is known that a particular application is going to generate a lot of storage I/O, its workloads can be effectively segregated by caching its most active data into a local Flash or SSD device.
For example, by installing high speed PCI-e Flash directly into the hypervisor where the VM is installed, administrators can use some caching software technologies to “pin” the VM’s data into a local cache. This will allow the VM to access “hot data” directly over the PCI-e bus of the host without traversing a network to access shared storage resources. This has the effect of freeing up the shared storage system from having to process these workloads so that it can service the less I/O demanding VMs in the environment – potentially extending the life of this asset.
Virtually Aware
Another technique for silencing noisy neighbors is to use storage solutions which have built-in awareness of the virtualized server environment it is serving. Some vendors have designed array based storage systems which can non-disruptively peer into the real-time I/O workload needs of every VM in the environment and dynamically allocate resources to meet the demands of the most I/O hungry applications; without impacting the other VMs in the environment. In addition to silencing noisy neighbors, these offerings relieve administrators from having to constantly analyze and re-analyze performance. Performance tuning, in effect, becomes fully automated.
While the above solutions can help, they still can be potentially bottlenecked by a dedicated controller architecture.
Hypervisor Controller Based
To remedy this problem a new approach has emerged for mitigating the risk of noisy neighbors— moving the disk controller functionality out of the storage array and into the hypervisor itself. Some vendors have developed storage offerings which are completely software based and will work with any internal server storage resource – PCI-Flash, SSD, SAS/SATA, etc. This enables infrastructure planners to commoditize disk and utilize existing x86 resources to deploy virtualized storage controllers. In other words, each hypervisor deploys a VM based virtual appliance that acts as its own local storage controller. Moreover, each hypervisor is stitched together with “n” number of hosts for enhanced resiliency, performance capacity and scalability.
These scale-out, server node based architectures enable administrators to set VM QoS through service level based polices that are set at the virtual controller. For example, a platinum service level policy may consist of provisioning up PCI-e Flash based resources to high priority VMs, while a Silver service level policy doles out server-side SAS disk access. This type of design may be ideal for large computational server farms where there is excess compute capacity.
Conclusion
Solving the noisy neighbor problem in virtualized environments is a key requirement for enabling increased VM density and ensuring the reliable performance of Tier-1 business applications. In fact, the number one reason typically given when explaining why a data center has not reached 100% virtualization is the concern about the performance of critical applications.
Depending on where you are at in your storage technology lifecycle, will most likely determine which approach you take to solve for this issue. The good news is you have options available to you. The best choice will depend on a number of different variables. Storage Switzerland has all the information you need to help you find the best solution given your specific situation.
What do you think about just limiting the IOPS in vSphere in VM properties? E.G. – Limit the IOPS on the everything but the disk I/O intensive VM’s? (until you can afford a better SAN)
[…] Read on here […]
[…] response to my recent article, “What Is A Noisy Neighbor”, one of our reader’s asked if limiting the number of IOPS (via VSphere) on all but the most […]
[…] is the ‘noisy neighbor’ issue that Storage Switzerland documented in a recent article “What Is A Noisy Neighbor?”. This is the term used to describe what happens when a single application has a peak in I/O […]
[…] a throttled compute node, precipitate from a poorly designed database table or spill over from a noisy neighbor, and that’s just scraping the surface. VMware’s vRealize Operations Manager (vROps) can […]
[…] Elasticsearch. Many of the issues they faced include not being able to scale due to the " noisy neighbor " effect. Docker seemed like a good option and Kubernetes the right option to manage networked […]