Most Artificial Intelligence (AI) projects start off as a skunkworks project that when ready for production is tossed over the wall for IT to handle. In many cases, IT tries to store the AI workload on an existing network attached storage (NAS) system. In our webinar, “Three Reasons Why NAS is No Good for AI and Machine Learning,” we discussed the problems with using a traditional NAS infrastructure for AI. Another option that IT often considers is a storage system designed for High-Performance Compute (HPC) compute workloads.
How AI Differs from HPC
The problem with using HPC systems is that the HPC workload is fundamentally different from AI. First, AI has a much broader appeal than the traditional HPC workload. HPC is typically limited to Academia and a few particular industries like Healthcare and Energy. AI, on the other hand, appeals to almost every organization in a variety of use cases ranging from autonomous vehicles and to cyber-security.
The second difference between HPC and AI is the data. Data in AI workloads often consists of millions if not billions of small files, which the system collects and analyzes in real-time. Storage system performance matters with AI workloads because AI needs to appear as responsive as the human brain to the user interacting with it.
These small files create a significant challenge for the file system that stores them. Each file creates metadata, and most file systems reach a threshold that significantly impacts performance once the metadata under management breaks that threshold. A scalable metadata management engine is critical for a file system that stores AI workload data.
The real-time nature of AI is another critical difference from HPC workloads. Most HPC workloads process data in batches. While quickly generating an answer is vital in HPC, the analyzed data is typically already present. AI often analyzes data fed in from a sensor or device that stored its data only moments ago. The AI file system needs to provide both excellent ingest performance as well as excellent egress performance.
Bandwidth is essential to both HPC and AI workloads but, again, AI is dealing with very small files in real-time. Streaming very small files, again, puts pressure on the file system’s metadata performance capabilities. Many HPC systems can bottleneck when trying to stream, at high speed, these millions and billions of small files.
Can HPC Adapt to AI
Some HPC storage vendors have tried to address the AI market by adding a cache, often called a “burst buffer,” to address the metadata bottleneck problem. A burst buffer is typically a separate storage volume made up of flash media. The challenge with a burst buffer is that it not only adds expense, but it also adds complexity.
Management of a burst buffer usually occurs outside of the file system, meaning that IT has two components it needs to manage instead of one. There is also the challenge that if the burst buffer exceeds the capacity, the processing of the AI steam slows to the speed of the file system.
In order to address the needs of AI, HPC vendors will likely need to re-write their file systems or create a new metadata friendly file system. Both options are time-consuming tasks. At the speed that AI is moving into the enterprise, most organizations can’t wait for the re-development of an old file system. A new file system designed specifically for the AI use case is required.
We discuss the requirements for a new AI file system in our webinar, “Three Reasons Why NAS is No Good for AI and Machine Learning.” As IT tries to fulfill requests to manage AI workloads moving into production, it needs to look for new file systems designed specifically for the AI use case. Extending old legacy file systems (NAS) or trying to bend HPC systems into solutions won’t allow the AI system to achieve its ultimate goal, of appearing human.