Artificial Intelligence (AI) at scale raises the bar for storage infrastructure in terms of capacity and performance. It is not uncommon for an AI or machine learning (ML) environment to expect growth to dozens if not hundreds of terabytes of capacity. Despite what vendors that only offer all-flash arrays might claim, these environments are simply too large to be stored on only one tier of all-flash. Most of these environments, because of their parallel nature, are served almost as well from hard disk drives as they are from flash.
Requirement #1 – High-Performance Networking
It is not uncommon for AI/ML environments to create a cluster of computing servers that use internal or direct-attached storage (DAS). Even though shared storage is much more efficient at using available capacity and distributing the workload more evenly to computing nodes, organizations are willing to sacrifice these efficiencies to eliminate the latency of the network between the computing nodes and the shared storage creates.
NVMe Over Fabrics (NVMe-oF) is a next-generation network explicitly designed for memory-based storage devices like flash and non-volatile RAM. It delivers latencies that are nearly identical to DAS NVMe. NVMe’s deep command and queue depths also make it ideal for highly parallelized workloads and, AI/ML are potentially the most parallel of all workloads. NVMe-oF may have been designed specifically for memory storage, but it is also tailor-made for AI/ML.
Requirement #2 – Shared Storage
If NVMe-oF can resolve the latency issues between computing and storage, then it enables the second requirement, shared storage. An NVMe-oF connected shared storage solution enables the workload to benefit from all of the natural attributes of shared storage. First, all nodes have access to all data, which means that the workload can more evenly distribute its computing load. It also means that nodes with Graphics Processing Units (GPU) can access all the data. Since GPUs are significantly more expensive than a CPU, keeping GPUs busy is a high priority, and shared storage makes that easier.
When measuring a workload’s capacity requirements in dozens, if not hundreds of petabytes, any gains in storage efficiency can provide dramatic cost savings. In a cluster with dedicated drives for each computing node, IT cannot easily reassign available storage capacity to other nodes in the cluster. The lack of resource pooling in the DAS model also means that the organization can’t effectively use the high capacity drives coming to market from manufacturers. A dual-purpose node (computing and storage) now has the potential to have 12 or more 16TB+ flash drives or 18TB+ hard disk drives installed, which a single node may not use effectively. If the AI/ML storage architecture pools those same drives from dedicated servers, they can be more granularly allocated. An AI/ML workload needs to not only scale-out to meet the capacity requirements, but storage nodes must also be directly accessible to meet the performance demands.
Requirement #3 – Multi-Tier
Given the size of the AI/ML data set, tiering is almost a must since dozens of petabytes of flash is simply too expensive. In fairness, some AI workloads don’t follow the 80/20 rule, where at any given time, 80% of the data is inactive. These workloads can go from 100% dormant to 100% active. Still, they are highly parallel, and hundreds of lower performance hard disk drives all feeding the workload at the same time should deliver the performance these workloads need. If not, they can deliver data as quickly as current networking technologies allow.
Requirement #4 – Parallel Access
Parallel access means that each node in the storage infrastructure provides each computing node in the AI/ML cluster direct access to the data it needs. A single control node does not bottleneck it. The high level of parallelism is critical to AI/ML because of the number of computing nodes that may want simultaneous access to the storage pool. It is the parallelism that enables the throughput to make hard disk drives viable as a component within AI/ML storage infrastructures. A parallel file system almost always requires a client or agent, but that agent, in addition to providing parallel access, also frequently requires less overhead than the typical NFS protocol.
Requirement #5 – Multiple Protocols
Despite the requirement for parallel access for processing, another requirement is multi-protocol access, which is particularly helpful for ingesting data into the storage infrastructure. Many AI and ML projects receive their data from Internet of Things (IoT) devices. These devices often need to communicate with the protocol that comes with it. Many devices communicate via SMB or NFS, and a few use S3. More importantly, almost none use a native parallel file system client.
Requirement #6 – Advanced Metadata Handling
AI/ML workloads are metadata heavy, although not typically because they use rich metadata like a media and entertainment workload might. The importance of metadata in AI/ML workloads comes from the sheer number of files common to it. In most cases, the dozens to hundreds of petabytes in AI workloads are made up of billions of files. Each of those files has metadata, and just like other workloads, the bulk of IO transactions are to/from the metadata. The AI/ML storage infrastructure has to manage the metadata so that it maintains the performance of the system even as the file count grows. The metadata needs distribution across the storage cluster so that all the nodes participate in its management. Vendors might also look at storage metadata on flash in each storage node to make sure the system is always responsive.
Conclusion
AI/ML workloads are fundamentally different from any other workload the organization may have run in the past. Early AI/ML projects have counted on DAS for data storage. The problem is that DAS doesn’t distribute the load evenly, something that is critical as the number of GPUs per AI workload increases. Also, DAS is highly inefficient, and the waste in capacity and time spent copying or moving data eliminates the price advantage of cheap internal drives.
Sponsored By Panasas
[…] Crump writes about “The Requirements of AI at Scale“, where he believes that NVMe and NVMe-oF technology is tailor-made for […]