Modern unstructured workloads are setting new standards in terms of not only capacity requirements but also performance. These workloads either need to analyze and ingest hundreds of thousands of files or they need to quickly read files over a terabyte in size. Often, these workloads are assisted by graphics processing units (GPU) for processing. Keeping those GPUs busy is critical to maximizing the investment the organization is making in them. Optimizing unstructured data performance is vital to the success of these workloads but creating high-performance unstructured data solutions is often full of compromises.
Addressing the “At Scale Problem”
Thanks to All-NVMe flash-based storage, high performance storage is accessible to many organizations today. High performance All-NVMe flash systems are now similarly priced to SAS SSD and performance-oriented disk-based systems when dealing with more moderate capacities. Solving the performance problem, however, becomes more challenging when dealing with multi-petabyte environments. While flash does solve the performance problem, storing multiple petabytes of data on flash creates new challenges. The price delta between a hard-disk system and a flash-based system may be acceptable when the storage system only has 24 or 48 drives, but in a system that has hundreds or even thousands of drives, the price difference quickly adds up.
In addition to making capacity affordable for petabyte scale workloads, these systems need to also provide a scale-out architecture that easily expands to keep pace with continuous growth in unstructured data. The ability to expand is more than a raw capacity requirement though. These systems need metadata constructs that can manage billions of files and data protection algorithms that won’t require double the storage capacity.
There is a category of storage solutions on the market labeled “Extreme Performance” solutions. Vendors of these solutions provide systems that ignore costs, and sacrifice capacity to deliver millions of IOPS. They typically use a brute force approach to performance, which in the markets they serve, is acceptable.
Petabyte-scale environments, however, require something different than brute force performance. They need to judiciously use the right mix of random access memory (RAM) as storage, flash storage, and hard disk drive storage. These systems require more than just simple caching of hard disk with flash and RAM. They need to be intelligent in their approach to and use of predictive analytics to prefetch data so the user doesn’t have to wait for caches to warm up. The system also needs to be intelligent enough not to copy data to flash or RAM if the type of file being accessed, or the way it is being accessed, won’t benefit from high performance storage.
These environments, at least for today, must use hard disk drives. The cost of hard disk capacities continues to be significantly less than flash-based systems. Hard disk vendors are also continuing to push the price-per-GB envelope with higher capacity drives. These hard drives, in order to meet the performance requirement, need to be front-ended with flash and RAM, but to curtail storage expenses, hard drives continue to play a key role in petabyte-scale storage solutions. Intelligent caching, plus a generous flash cache area, go a long way towards hiding the performance differential between flash and hard disk.
A petabyte-scale environment is so different from the normal day-to-day data center system that conventional wisdom no longer applies. These systems often deal with terabyte-plus sized files or billions of small files, and sometimes both. The capacity and metadata management requirements of these systems are as critical as making sure performance expectations are met.
Legacy Network Attached Storage (NAS) was never designed for petabyte-scale workloads. Many customers are looking for alternatives. The choices are all-flash systems, which may address the performance challenge, but are too expensive, and high capacity object storage systems, which address the cost concern, but may be too slow. Instead, petabyte-scale customers need a new file system written from the ground up for their use case. They need an intelligent balance of high performance and high capacity while not sacrificing reliability.
To learn more about addressing the performance concerns of petabyte-scale organizations watch our Lightboard Video: “Understanding Petascale Performance Demands”.