Traditionally, systems create data, process it, and then store it for a time before accessing it again in response to a user request. Essentially, data is processed in batches. In the modern data center, the move is toward real-time analysis to drive machine learning and artificial intelligence. Data is now streamed and applications operate on it as it is created or very soon thereafter. The problem for storage networking is if there is an interruption in the data flow, real-time becomes downtime.
In modern environments, network performance remains important. It has to keep up with all-flash arrays and scale-out compute clusters rapidly analyzing data to simulate human brain response times. As organizations move into these environments IT needs to make sure the storage network keeps up by making sure all connections including inter-switch links (ISL) and host connects are full bandwidth.
A potentially bigger challenge is identifying problems or hotspots in the storage network. The challenge lies in identifying most problems after the fact and the “alert” is users complaining about performance, which means that the real-time, streaming environment screeches to a halt.
Another problem is with most network monitoring solutions basing their monitoring functions on polling intervals, where every so often they ping the switch and gather information that happened minutes ago. Capturing data in this way is like reading every other paragraph of a book, you might get a general idea of what the book is about but you’ll miss many key points.
Modern storage networks need to provide built-in real-time telemetry data that is analyzed either by software provided by the switch vendor or a third party. Because the analysis software is on-the-switch analyzing data in real time, it misses nothing. In an environment like AI or machine learning the network utilization can change dramatically within seconds, so not missing any aspect of its performance is critical.
The challenge for switch manufacturers is how to provide the real-time analytics without impacting another key requirement, performance. In the past, intercepting realtime network information without impacting performance, required external taps which are challenging to install after the network infrastructure is implemented. Now vendors are implementing custom ASICS in their switching hardware that captures this information without any impact on switch performance.
The advantage for IT is, once it upgrades to switches that provide the telemetry capabilities, they can monitor the network in realtime and detect infrastructure problems long before they impact production performance. Software solutions that analyze telemetry data are building machine learning into their products too. Once enabled, they will better alert IT professionals to potential problems and make recommendations for remediation. Eventually the network will have AI and take corrective actions on its own based on the telemetry data and how it’s observed the user correcting a similar problem in the past.
StorageSwiss Take
The evolution in network troubleshooting is critical. Organizations can’t hire enough IT staff to respond to every infrastructure anomaly, so the network will have to take care of itself. The first step though is collecting this data and infrastructure with built in telemetry via ASICs seems to be the logical way forward.
To learn more about the next generation of networking, watch Storage Switzerland and Cisco discuss the topic in our on demand webinar “Faster, Smarter, Simpler – The New Requirements in Storage Networking“.