NFS started life as a protocol for unstructured data, then in the early 2000’s NetApp and Veritas started to position NFS as an ideal way to host Oracle databases. NFS was easier and performed admirably when compared to the more conventional block storage systems. Later VMware and other hypervisor vendors added support for NFS as a way to host virtual machines. Again, logical since VMs were essentially files. NFS became a protocol to support mission critical workloads but something was missing when compared to mission critical fibre channel: The ability to monitor and diagnose specific NFS storage problems.
NFS is, obviously, an IP-based protocol and there are plenty of solutions that will monitor and diagnose IP networking issues. But NFS is more than just an IP protocol. It is a IP storage protocol and the system that the storage team manages, not the network team. That means that complaints about performance also go to the storage team. Additionally, the focus of most network monitoring tools do not include NFS storage issues.
Mission Critical NFS, what could go wrong?
NFS storage systems can experience many of the same storage performance problems as their fibre channel counterparts, but they also share some unique NFS only problems. IT administrators need to be able to detect and remedy common issues like rogue clients and noisy neighbors, latency issues between a server/virtual machine and the NFS storage system, and, of course, any write performance issues.
Unique to NFS are concerns around metadata and, in a scale-out NAS environment, bottlenecks between nodes in the cluster. NFS, like any filesystem uses metadata to track information about each file it stores. This information can be as basic as creation and modification dates and sophisticated as serial number of device creating the data. Any access to any file has to be handled via metadata and in NFS metadata can be as much as 80% of the IO stream. Detecting issues with metadata early is critical to scaling NAS to its specific capabilities.
As NFS became prominently used for mission critical workloads, problems appeared with traditional scale-up NAS storage system and lead to organizations ending up with dozens of NAS systems. To overcome these problems vendors created scale-out NAS systems, and while they solved the capacity scaling problem, they created an even greater metadata performance problem. They also added a new problem: cluster node bottlenecks. As a scale-out NAS system scales the nodes that make up the cluster must communicate with each other at an exponentially increasing rate. The larger number of nodes require much more communication. Any issue with the inter-networking between these nodes could easily lead to poor system response time.
Solving Mission Critical NAS Storage Problems
Any mission critical storage system will have issues that arise as the use of and demands on the system increase. Overcoming these problems requires proactive monitoring of the environment that provides both real-time and historical feedback. In our on demand webinar “The Five Problems Facing Business-Critical NFS Deployments” we discuss the challenges that Mission Critical NAS faces and how to overcome them.

