When designing a storage infrastructure for an artificial intelligence (AI) or deep learning (DL) workload, the default assumption is that an all-flash array (AFA) or something even faster must be at the heart of the design. The problem is as AI and DL workloads continue to become more mainstream, the capacity requirements of these environments continue to scale. AI and DL workloads demanding dozens of petabytes is not uncommon. As a result, AFAs or any memory-based storage may be the wrong answer.
There are different types of AI and DL workloads, and the term is sometimes used to describe workloads that aren’t AI or DL. AI and DL workloads that tend to create problems for an AFA-only infrastructure are ones that use frameworks like TensorFlow, Kafka, or Hadoop where petabytes of data are active at all times.
Why are AFAs Bad for AI/DL?
AFAs are not particularly bad for AI or DL, but they may not be practical. While the flash that goes into the typical all-flash array continues to drop in price per GB, a full-scale AI/DL workload again is measured in multiple petabytes. Consequently, hard drives, at the petabyte scale, remain a significantly less expensive option.
Even using an AFA as a cache for the most active part of the AI/DL workload may not be practical. The working set of data is typically so large that the entire cache is overrun long before the job is complete, which means the workflow needs to continue reading the vast majority its data from hard disk drives. AI/DL workloads violate the typical 80/20 rule, where 80% of the data is inactive. They are either very idle or very active. If it takes 15 petabytes of information to feed a neural network, the 15 petabytes will not be used all at once, but an unknown large subset of that data. And next time, a different subset will. Often there is no predictability to which subset will be required when. As a result, the whole dataset has to be active/available and on storage where it can be quickly utilized.
Even if the organization can afford enough flash to make an AFA tier make sense, they then have the added complexity of moving petabytes of information around their infrastructure for optimal performance and cost savings. Data tiering is already hard; at the petabyte scale, it might be impossible.
Although experts sometimes describe these workloads as random IO workloads, AI/DL is not as random as a database, a workload where AFAs shine. AI/DL workloads are typically file-based, accessing billions of files within an unstructured data store. These files are small, but they are not 4K or 8K, a typical read size in database IO. Instead, these AI/DL IOs tend to be 100K or more. While not as big as the files that a production movie studio might produce, they are much larger than the typical database IO. AI/DL workloads sit somewhere between a random workload and a sequential workload. Is it Randential?
Another challenge is most AFAs are block-based. This means, to host an AI or DL workload, the organization needs to add a file system to the AFA, which adds to the cost and complexity of the solution. It also adds performance overhead in that there is another layer for the workload to navigate.
Is Object Storage the Answer?
An alternative is to use object storage as primary storage for the AI/DL environment. While experts often think of object storage as slow, inexpensive storage for supporting secondary workloads like archive and backup, with the right object storage software, it may also be an ideal candidate for AI and DL. Most of the frameworks that drive AI and DL workloads, like TensorFlow, already are S3 native, making them ready to support any S3 compatible object storage solution. Native support by these frameworks means that the infrastructure requires no additional file system, thus reducing cost, complexity, and overhead.
A vital component of the object storage solution, though, is its need to support highly parallel read/write access. An object store with dozens or hundreds of hard-disk based nodes can deliver a tremendous amount of throughput. Many AI/DL workloads count on throughput to drive data to their graphics processing units (GPU). The object store must be able to provide parallel access to any set of nodes in the cluster and not be bottlenecked by a set of central control nodes.
A key component for any system supporting AI/DL is for it to also handle the large amount of metadata that these environments create. The large AI/DL metadata set is mostly the result of the billions, potentially trillions of files it stores, and not because of the sophisticated information stored within it. A metadata controller or even a cluster of controllers may eventually hit a scale limitation that forces the AI/DL workloads to split. Splitting workloads raises costs, increases inefficiencies, and makes operations far more difficult.
Instead, the object store needs to embed the metadata within each storage node, so the processing of the metadata is more seaDLess and infinitely scalable.
StorageSwiss Take
As AI/DL workloads move into the mainstream, many organizations are finding that supporting the workload on an AFA is too expensive and doesn’t provide the performance benefit they are expecting. Instead, it makes more sense to create an environment that provides parallel access to the AI/DL dataset and leverages the high capacity of these environments to its advantage. In object storage, high capacity requires a high storage node count. Even if those nodes are all hard-drive-based, they can deliver enough parallel throughput to keep GPUs well fed.
In a recent episode of Storage Intensity, we sat down with Joe Arnold, CTO, and Founder of SwiftStack to discuss the concept of using object storage as primary storage for AI and DL. Click the link below to listen. If you want to subscribe to Storage Intensity, click one of the buttons below to subscribe to the podcast on your favorite podcast app. We release two new episodes a week, and all our podcasts are sit-down conversations with storage experts from across the industry.