The Requirements of AI at Scale Storage Infrastructures

Posted on January 9, 2020 by George Crump

Artificial Intelligence (AI) at scale raises the bar for storage infrastructure in terms of capacity and performance. It is not uncommon for an AI or machine learning (ML) environment to expect growth to dozens if not hundreds of terabytes of capacity. Despite what vendors that only offer all-flash arrays might claim, these environments are simply too large to be stored on only one tier of all-flash. Most of these environments, because of their parallel nature, are served almost as well from hard disk drives as they are from flash.

Requirement #1 – High-Performance Networking

It is not uncommon for AI/ML environments to create a cluster of computing servers that use internal or direct-attached storage (DAS). Even though shared storage is much more efficient at using available capacity and distributing the workload more evenly to computing nodes, organizations are willing to sacrifice these efficiencies to eliminate the latency of the network between the computing nodes and the shared storage creates.

NVMe Over Fabrics (NVMe-oF) is a next-generation network explicitly designed for memory-based storage devices like flash and non-volatile RAM. It delivers latencies that are nearly identical to DAS NVMe. NVMe’s deep command and queue depths also make it ideal for highly parallelized workloads and, AI/ML are potentially the most parallel of all workloads. NVMe-oF may have been designed specifically for memory storage, but it is also tailor-made for AI/ML.

Requirement #2 – Shared Storage

If NVMe-oF can resolve the latency issues between computing and storage, then it enables the second requirement, shared storage. An NVMe-oF connected shared storage solution enables the workload to benefit from all of the natural attributes of shared storage. First, all nodes have access to all data, which means that the workload can more evenly distribute its computing load. It also means that nodes with Graphics Processing Units (GPU) can access all the data. Since GPUs are significantly more expensive than a CPU, keeping GPUs busy is a high priority, and shared storage makes that easier.

When measuring a workload’s capacity requirements in dozens, if not hundreds of petabytes, any gains in storage efficiency can provide dramatic cost savings. In a cluster with dedicated drives for each computing node, IT cannot easily reassign available storage capacity to other nodes in the cluster. The lack of resource pooling in the DAS model also means that the organization can’t effectively use the high capacity drives coming to market from manufacturers. A dual-purpose node (computing and storage) now has the potential to have 12 or more 16TB+ flash drives or 18TB+ hard disk drives installed, which a single node may not use effectively. If the AI/ML storage architecture pools those same drives from dedicated servers, they can be more granularly allocated. An AI/ML workload needs to not only scale-out to meet the capacity requirements, but storage nodes must also be directly accessible to meet the performance demands.

Requirement #3 – Multi-Tier

Given the size of the AI/ML data set, tiering is almost a must since dozens of petabytes of flash is simply too expensive. In fairness, some AI workloads don’t follow the 80/20 rule, where at any given time, 80% of the data is inactive. These workloads can go from 100% dormant to 100% active. Still, they are highly parallel, and hundreds of lower performance hard disk drives all feeding the workload at the same time should deliver the performance these workloads need. If not, they can deliver data as quickly as current networking technologies allow.

Requirement #4 – Parallel Access

Parallel access means that each node in the storage infrastructure provides each computing node in the AI/ML cluster direct access to the data it needs. A single control node does not bottleneck it. The high level of parallelism is critical to AI/ML because of the number of computing nodes that may want simultaneous access to the storage pool. It is the parallelism that enables the throughput to make hard disk drives viable as a component within AI/ML storage infrastructures. A parallel file system almost always requires a client or agent, but that agent, in addition to providing parallel access, also frequently requires less overhead than the typical NFS protocol.

Requirement #5 – Multiple Protocols

Despite the requirement for parallel access for processing, another requirement is multi-protocol access, which is particularly helpful for ingesting data into the storage infrastructure. Many AI and ML projects receive their data from Internet of Things (IoT) devices. These devices often need to communicate with the protocol that comes with it. Many devices communicate via SMB or NFS, and a few use S3. More importantly, almost none use a native parallel file system client.

Requirement #6 – Advanced Metadata Handling

Can Current Storage Infrastructure Meet the AI at Scale Demand?

AI/ML workloads are metadata heavy, although not typically because they use rich metadata like a media and entertainment workload might. The importance of metadata in AI/ML workloads comes from the sheer number of files common to it. In most cases, the dozens to hundreds of petabytes in AI workloads are made up of billions of files. Each of those files has metadata, and just like other workloads, the bulk of IO transactions are to/from the metadata. The AI/ML storage infrastructure has to manage the metadata so that it maintains the performance of the system even as the file count grows. The metadata needs distribution across the storage cluster so that all the nodes participate in its management. Vendors might also look at storage metadata on flash in each storage node to make sure the system is always responsive.

Conclusion

AI/ML workloads are fundamentally different from any other workload the organization may have run in the past. Early AI/ML projects have counted on DAS for data storage. The problem is that DAS doesn’t distribute the load evenly, something that is critical as the number of GPUs per AI workload increases. Also, DAS is highly inefficient, and the waste in capacity and time spent copying or moving data eliminates the price advantage of cheap internal drives.

Sponsored By Panasas

Sign up for our Newsletter. Get updates on our latest articles and webinars, plus EXCLUSIVE subscriber only content.

About George Crump

George Crump is the Chief Marketing Officer at VergeIO, the leader in Ultraconverged Infrastructure. Prior to VergeIO he was Chief Product Strategist at StorONE. Before assuming roles with innovative technology vendors, George spent almost 14 years as the founder and lead analyst at Storage Switzerland. In his spare time, he continues to write blogs on Storage Switzerland to educate IT professionals on all aspects of data center storage. He is the primary contributor to Storage Switzerland and is a heavily sought-after public speaker. With over 30 years of experience designing storage solutions for data centers across the US, he has seen the birth of such technologies as RAID, NAS, SAN, Virtualization, Cloud, and Enterprise Flash. Before founding Storage Switzerland, he was CTO at one of the nation's largest storage integrators, where he was in charge of technology testing, integration, and product selection.

Tagged with: AI, All-Flash, Analytics, Cloud, DAS, Flash, GPU, IoT, Metadata, ML, NFS, NVMe, NVMe-oF, Panasas, SMB
Posted in Blog

One comment on “The Requirements of AI at Scale Storage Infrastructures”

Storage Short Take #23 – J Metz's Blog says:

January 31, 2020 at 8:25 am

[…] Crump writes about “The Requirements of AI at Scale“, where he believes that NVMe and NVMe-oF technology is tailor-made for […]

Comments are closed.