Overcoming the NVMe-oF Blame Game

Non-Volatile Memory Express over Fabrics (NVMe-oF) enables shared storage systems to achieve very low latency, to the point that they can now rival direct-attach storage solutions. Organizations no longer have to suffer through the inefficiencies of direct-attached storage, standard in many modern application deployments. The problem is though, if applications don’t see a dramatic performance improvement, the finger of blame will almost undoubtedly point at storage and the storage infrastructure.

IT professionals need to be prepared to troubleshoot performance problems quickly and to prove the network is operating at its full potential. NVMe-oF networks, because of their low latency, make it more difficult to capture the necessary data so IT can validate network performance. If the network monitoring solution attempts to capture data in real-time, it might impact network performance. If the network monitoring solution uses a less impactful polling method, it might miss the information required to confirm network configurations are correct.

NVMe-oF on Fibre Channel (NVMe/FC) is one of the early leaders in the move to NVMe-oF infrastructures. One of the reasons for Fibre Channel’s (FC) early lead in NVMe-oF deployments is its ability to co-exist with the legacy SCSI protocol. FC can support both NVMe and SCSI protocols at the same time, which enables organizations to transition to NVMe gradually. NVMe/FC is also in a good position to help IT professionals confirm the network is performing up to its full potential.

Early on, only a few applications in the environment could take advantage of the low latency and high bandwidth of NVMe/FC infrastructure. Over time, though, as traditional databases continue to scale up and as virtualization and containerized workloads continue to increase in density, they too will realize the benefit from NVMe-oF. The gradual transition to NVMe-oF and FC’s ability to support that transition make it a favorite of many data centers.

The challenge that IT professionals face as they move to NVMe-oF is making sure the infrastructure is operating at full capabilities. Confirming optimal operation is going to be difficult since initially, only a few workloads will tax the new infrastructure. When application owners don’t realize the expected performance gains, they will blame the storage infrastructure, when, in reality, it is the application itself which needs to be better optimized.

There are several ways to diagnose SCSI networks today. The network management software can capture data via a polling type of solution or a real-time “tap” into the infrastructure with a probe. A polling solution requires no change to the network, but it only captures data at the set polling intervals. The problem is that the polling-based network monitoring solution may miss an essential set of telemetry data occurring between the two intervals. The potential for a miss of a significant chunk of telemetry data increases with lower latency, high bandwidth NVMe/FC. The real-time tap method is preferred, but organizations are often reluctant due to the intrusiveness of a network tap, and there is concern over the performance impact that a real-time capture might create.

Last year, Cisco released a new feature in their storage networking switches that provides a unique alternative solution. The new switches now include a custom ASIC that captures information in real-time without impacting network performance or requiring the insertion of an intrusive network tap. Also compelling is the ASIC does not require that the customer only use Cisco software to visualize the telemetry data. Third parties like Virtual Instruments can access the data, or it can output the data to the customer’s data lake so they can use their own tools.

Recently, Cisco also added the capability for this ASIC to capture telemetry data on NVMe/FC simultaneously with SCSI. Now, not only do the switches support NVMe/FC and SCSI at the same time, but they can also capture and deliver telemetry data from both protocols at the same time.

StorageSwiss Take

The faster and lower latency network infrastructure creates more pressure on analytics tools to capture data quickly. Real-time data capture is a requirement. The problem is that real-time capture has in the past, either meant intrusive network modifications or settling for lower fidelity captures. However, Cisco’s ASIC changes the game. It provides IT professionals with the data they need to confirm that the network is performing optimally so development teams can focus on making sure their code is optimally using the new low latency infrastructure.

Sign up for our Newsletter. Get updates on our latest articles and webinars, plus EXCLUSIVE subscriber only content.

George Crump is the Chief Marketing Officer at VergeIO, the leader in Ultraconverged Infrastructure. Prior to VergeIO he was Chief Product Strategist at StorONE. Before assuming roles with innovative technology vendors, George spent almost 14 years as the founder and lead analyst at Storage Switzerland. In his spare time, he continues to write blogs on Storage Switzerland to educate IT professionals on all aspects of data center storage. He is the primary contributor to Storage Switzerland and is a heavily sought-after public speaker. With over 30 years of experience designing storage solutions for data centers across the US, he has seen the birth of such technologies as RAID, NAS, SAN, Virtualization, Cloud, and Enterprise Flash. Before founding Storage Switzerland, he was CTO at one of the nation's largest storage integrators, where he was in charge of technology testing, integration, and product selection.

Tagged with: , , , , , , , , ,
Posted in Briefing Note

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 25,553 other subscribers
Blog Stats
%d bloggers like this: