Artificial Intelligence (AI) is in its infancy and the requirements it places on the storage architectures that support these workloads are not widely understood. As a result, an organization starting an AI initiative can initially get away with using an all-flash storage system. Getting these projects off to a quick start is so crucial that most organizations can justify 100TBs of flash investment. A flash-only strategy may work for the organization as it moves through development, testing, and even early production. The challenge is dealing with the storage infrastructure requirements as these environments go mainstream. As AI workloads reach full production, their storage requirements can grow exponentially. One hundred petabyte plus environments with dozen and potentially hundreds of GPUs may not be uncommon.
There are three primary challenges that AI-at-Scale creates for the storage infrastructure. The storage infrastructure needs to support high capacity storage requirements, long term data retention, and of course, high-performance processing.
AI at Scale Requires Capacity at Scale
Artificial intelligence simulates human intelligence by processing billions, if not trillions of data points, fast. In some cases, each data point is in a separate file that is read sequentially. These files need to be stored, but given the capacity requirements of AI environments in full production, the current approach of storing all data on an all-flash array is impractical. Some organizations may justify the expense of petabytes of flash storage, but if one of their competitors architects a similarly performing storage infrastructure leveraging less expensive media then that competitor will have a distinct advantage.
AI at Scale Requires Retention
The second requirement of a storage system that supports a full production AI workload is retention. The accuracy of AI increases in lockstep with the amount of training data that it can access. The organization rarely deletes the data points it uses to train its AI workloads because of the initial cost in acquiring it. These data sets also do not follow the typical data access model of decreased chance of use as it ages. The chances of the AI workload needing to reprocess old training data are almost 100%, so the entire data set needs to remain readily accessible. Once again, AI in production is an ideal workload for a mixed flash and hard disk environment. The storage system needs to have the intelligence to manage metadata quickly and store the right type of data on the correct type of storage.
AI at Scale Requires Performance
The third requirement is, of course, high performance, but how that performance is delivered is not quite so obvious. Training an AI application is an iterative process. Improving accuracy is a process of repeated training, tweaking the AI algorithm and then training again. The faster the iteration occurs the more accurate the developer can make the model, which increases the pressure on the storage infrastructure. The key in most AI workloads is to make sure that the graphics processing units (GPU), standard in these environments, are kept as busy as possible. The cost of GPUs is too high for them to be idle for too long. Depending on the AI workload, a scale-out storage system with many nodes and a mixture of flash and hard disk may easily meet the GPU IO requirements. AI workloads tend to be very parallel, and a parallel, scale-out storage cluster may meet the challenge even with hard disk drives.
Another key in AI performance is how well the storage infrastructure manages metadata. High metadata performance, given the high file/object count of AI data sets, is critical to the success of the storage infrastructure. The storage infrastructure needs to segregate metadata from the standard IO path and storage so that the storage infrastructure can process metadata requests quickly.
High-performance metadata contributes to another aspect of performance, low latency. AI workloads are more in need of low latency than they require high IOPS. Metadata performance contributes to low latency, as does well-designed storage software and infrastructure.
Conclusion – AI Needs More than Flash
IT planners cannot create AI storage infrastructures that are exclusively flash. The cost of petabytes of flash is the primary reason. Another reason, though, is most all-flash systems are block-based. AI at scale needs multiple systems with GPUs in them to process all those data points, so a network-attached storage system is required. AI environments, however, can’t go to the extreme of a do-it-yourself infrastructure, such as those that are popular in academic High-Performance Computing. A system that is turnkey but doesn’t require proprietary components enables the organization to meet all of the requirements without getting bogged down in a science project.
Sponsored By Panasas