Artificial Intelligence (AI) is a broad term that can apply to various computing tasks, including machine learning, deep learning, and big data analytics. Many AI projects are in a proof of concept stage, but CIOs and IT Managers need to understand that in the future, almost every business outcome and workflow will use and depend upon some form of AI processing. The time is now to prepare the infrastructure for that eventuality. As AI environments move into production and begin to grow in size and importance, organizations need a strategy to address challenges the AI at scale will create for both the compute and storage architectures.
Like the Cloud Wave but Bigger
For the last decade, developing a cloud strategy was at the top of every CIO’s to-do list. Developing an AI strategy will quickly replace developing a cloud strategy, but the AI strategy is more critical. The cloud strategy impacts where the organization should store and process data, while the AI strategy will, quite literally, impact what business the organization does and how it does it. It is also reasonable to expect that most organizations will use each of the AI workload types to drive their businesses.
Addressing the Misconceptions about AI
Before organizations can establish an AI strategy, they need to clear up AI misconceptions. A common misconception is that AI is a single type of workload. In reality, each of the sub-types of AI processing (deep learning, machine learning, and big data analytics), are very distinct workloads with unique storage and IO characteristics. Each of these workloads requires different capabilities from the storage architecture. For example, image recognition today is dominated by deep learning techniques, and the workload that deep learning applies to the storage infrastructure is very different from the workload imposed by the other AI technologies. Organizations need a strategy to select a single storage infrastructure that meets the needs of the various AI workload types.
The Journey to an AI Infrastructure
For most organizations, the journey to an AI infrastructure starts with infrastructures that look very similar to High-Performance Computing (HPC) Infrastructures. Some similarities make HPC a logical starting point. HPC and AI both consist of many compute nodes, parallel file systems, and scale-out storage nodes to meet capacity and performance needs. HPC infrastructures are the evolution of computing from a single monolithic supercomputer to a scale-out compute cluster infrastructure leveraging dozens, hundreds and even thousands of commodity nodes.
Having so many nodes creates a problem. How can they all access a single file system at the same time? A file system or Network Attach Storage (NAS) system delivers universal access but also creates a performance problem. The file system head or NAS controller becomes the bottleneck. In HPC, the single node NAS head gave way to parallel file systems where all nodes have direct access to the file system without funneling through a file system controller. The parallel file system enabled HPC storage infrastructures to keep all the nodes at the HPC compute tier busy.
AI Infrastructures are on a similar path. NVIDA’s DGX-1 / DGX-2 Graphics Processing Units (GPUs) are the equivalent of the AI supercomputer. Today in the proof of concept phase, because of GPU cost, there are only a few nodes within an AI cluster that contain GPUs. When AI scales, every node may utilize GPUs.
Graphics Processing Units (GPUs) are more than just the brains behind AI’s neural networks. They are the core enabler of this intelligence. Keeping GPUs busy is job one for the rest of the infrastructure, which means much of the attention is turning to the storage infrastructure, which has to deliver data to these GPUs as fast as they can process it.
Because of the initial small, proof of concept scale, organizations, to some extent, can ignore the cost of storage infrastructures and use extremely costly, high-performance all-flash solutions. However, scaling these environments for production AI use cases requires a more intelligent approach, which leverages commercial parallel file systems. Instead of using just any storage node, the storage node, while still using commodity hardware, has to have specific capabilities and quality control to meet the performance and reliability demands of the various AI workloads. Otherwise, the organization may be forced to buy a unique storage system for each use case.
Conclusion
The various sub-types of AI processing each have unique types of IO profiles, but all have a common requirement of keeping GPUs busy. In our next blog we will look at the unique IO profiles of each AI-workload type, and the storage challenges the AI workload at scale creates.
Sponsored By Panasas