In our last blog we covered the challenges that AI at scale creates for storage infrastructures. To support the coming wave of AI applications, storage infrastructures need to deliver a tremendous amount of storage capacity with the ability to retain data for long periods of time, while also ensuring it is accessible. These systems also need to deliver high-performance, not only in terms of sequential small file Input/Output (IO) but especially in terms of metadata performance. This blog looks at current storage infrastructures to see if they can meet the AI at scale storage challenges.
Direct Attached Storage
Direct attached storage (DAS) is experiencing a rebirth in popularity. Modern applications are now architected in a way that provides protection from both media and node failure. The advantage of using DAS is that it eliminates network IO when processing data. It is also, at least on the surface, less expensive because it can use commodity storage instead of enterprise class storage found in typical shared storage system.
DAS, though, is plagued by its classic shortcomings. DAS systems, even in modern workloads, still end up creating islands of storage that isolate performance and capacity. These clusters are susceptible to having the wrong data on the wrong node, or making too many copies of data so that more nodes have the data, which significantly drives up the cost of the system. The cost of isolating central processing units (CPU) is multiplied in AI workloads where graphics processing units (GPU) are common. Organizations simply can’t afford to have $8,000 GPUs sitting idle because they can’t get to the data.
The networking disadvantage of shared storage systems is quickly being overcome thanks to higher and higher speed networking, better more efficient protocols and parallel file systems. Additionally, the DAS deployment still requires networking for copying data, and the need to increase the number of copies makes the situation worse.
Shared storage seems to be the better option to keep AI workloads running efficiently while making sure costs are contained. The next decision is which type of shared storage should IT select?
Shared scale-up storage is the most common and potentially most familiar of the available shared solutions. Like DAS, shared scale-up storage has classic shortcomings that AI workloads actually expose faster than traditional workloads would. The most obvious is how much total data can the systems store? Most scale-up systems can barely grow to one petabyte of storage per system and since most AI at scale workloads will require dozens if not hundreds of petabytes, the organization may require a similar number of scale-up storage systems.
Even if capacity challenges were overcome, AI at scale may create performance problems. These systems typically have a finite number of storage controllers. Two controllers are the most common. A typical AI workload is highly parallel, so it can easily overwhelm a small controller set.
Scale out storage systems may seem like a good alternative. If these systems are flash-based they may deliver the performance that AI at scale requires but they typically will run into two challenges. First, there is typically a set of control nodes which all IO routes through. These control nodes, like the controllers in scale-up storage, can also become a bottleneck. Secondly, most scale-out all-flash systems don’t efficiently move data to less expensive media tiers like hard disk drives. The cost of dozens or hundreds of petabytes of flash is prohibitive.
In theory, an organization can combine a scale-out all-flash array with a parallel file system to meet their design goals. The problem is, again, a lack of efficient data movement to less expensive storage. There is also the added problem of supporting a file system from one vendor and hardware from another. AI at scale compounds the problem because of its performance demands. Troubleshooting performance issues when multiple vendors are involved is very difficult.
Another option is the public cloud. The challenge from an AI perspective is the cost of using the public cloud. The public cloud makes the most sense for temporal workloads and where resource utilization is either idle or peak. Again, the goal of an AI infrastructure is to run GPUs at 100% for as long as possible, essentially always at peak. The cost of renting GPUs 100% of the time quickly exceeds the cost of buying the GPU outright. There is also the problem of the massive data set. Storing dozens or hundreds of petabytes of data in public cloud storage for a long period of time is very expensive. While the public cloud infrastructure may support AI at scale the cost of the infrastructure makes it impractical.
Traditional storage architectures aren’t well suited to the requirements of AI at scale. Organizations need a different type of storage infrastructure. The solution for AI at scale is a storage system that is on-premises. It needs to be shared and more than likely needs to be file system based. It needs to support true parallel access so that each compute node can directly interact with each storage node without routing through control nodes. It also needs to be multi-tier so that the massive AI data sets can be cost effectively stored. Finally, it needs to be very efficient and performant in how it handles metadata because of the number of files/objects typical in an AI at scale workload.
In our next blog we discuss how vendors need to architect next generation storage infrastructures to meet the demands of AI at scale.
Sponsored By Panasas