Requirements for Unstructured Data at Petabyte Scale

Posted on October 14, 2019 by George Crump

Unstructured data is hard to manage. When an organization’s unstructured data assets cross the petabyte threshold, though, controlling the data set brings on an entirely new set of challenges.

Most legacy network-attached storage (NAS) systems struggle to manage 100 terabyte unstructured data sets. Managing multiple petabytes is beyond their capabilities. The limitations of current NAS systems lead many IT professionals to assume that the only answer will be an object storage system. The challenge is that while object storage may scale to meet the capacity demand, it may fall short when it comes to performance.

One of the challenges that IT professionals face as they try to find a storage infrastructure to support petabyte-scale unstructured data sets is that the requirements must be taken as a whole because all the elements need to work together.

All-Flash is Not Practical

There is no denying that the cost of flash storage has come down in price significantly over the past five to six years, but hard disk storage is still substantially less expensive on a per-terabyte basis. At the same time, the performance demands on unstructured data have only increased over the past few years. Modern, unstructured data storage systems need to process both metadata and the actual data very quickly. While some all-flash vendors claim the demand for performance trumps the demand for capacity, in a petabyte-scale environment, buying a petabyte or more of flash media isn’t practical.

Modern, unstructured data storage systems need to intelligently use both flash and hard disk drive storage, and automatically move data between the two tiers as required. These systems can benefit from the lower cost of flash to increase flash capacity and lessen the impact of a cache miss. But they also need to leverage hard disk storage to keep costs in check. The modern, unstructured data storage system also needs to leverage cloud storage for both long-term archive and workload portability.

Metadata Must Scale

Another requirement is that metadata must scale to keep pace with the growth in the unstructured data set. Managing metadata is especially crucial since many unstructured data workloads are now dealing with millions, if not billions, of files. Each of these files, of course, generates metadata. Most file system vendors report that as much as 80% of all IO is metadata. In many cases, legacy NAS and file systems reach scaling limits because of metadata bottlenecks. The customer is forced to buy another storage system, even though technically the current system can provide much more capacity.

The file system should also leverage flash to deal with the metadata challenges that petabyte-scale unstructured data sets create. When data is either written or modified, the file system should extract the metadata about the file and store it on a separate area of flash. Storing metadata on flash not only provides fast access to metadata requests (again 80% of all IO is metadata), it also isolates that IO, leaving the path to the actual data less busy.

Capacity Must Scale

Addressing the petabyte-scale metadata challenge enables the NAS or file system to provide far more capacity than the prior generation of storage solutions, which means the file system needs to scale. It accomplishes scaling by clustering commodity servers, called nodes. Each node has internal storage capacity, both flash and hard disk, and contributes that storage to a global pool of storage. When the organization needs more capacity, IT adds another node that provides its capacity to the global pool.

Insights are Power

Another must-have is data insight. Given the number of files and the capacity they consume, IT needs to know as much about the data set as possible. The problem is that most file systems add their insights after the fact, and they have to manually scan their file systems, file-by-file, to access those insights. These scans take a lot of time, especially in file systems that number in the high millions, if not billions, of files.

IT needs real-time actionable data to monitor system performance and capacity utilization. These teams need to identify at a moment’s notice if a runaway process is consuming all of the file system’s available IO. Real-time analysis requires building this capability into the file system from the start, instead of adding it later. If the file system separates the metadata from the actual data and stores it on flash media, then the file system’s analytics feature can get to that data instantly and provide the organization with real-time answers.

StorageSwiss Take

Petabyte-scale unstructured data environments are just different than environments measured in TBs. The use cases tend to create and need access to more files faster than legacy unstructured data workloads. Storage capacities frequently surpass one petabyte, with many organizations being in the tens of petabytes range. Given the rise in artificial intelligence and machine learning, as well as the new demands of digital media, the demands on the file system will.

Modern, unstructured data storage solutions need to deal with these challenges holistically. They need to leverage flash, for multiple reasons, without forgoing the cost savings potential of hard disk drives. At the same time, these systems need to provide insight into the data so IT can effectively manage it.

Qumulo has an architecture designed to meet the modern demands of unstructured data. To learn more about their architecture, watch our Lightboard Video in which Eric Scollard, Qumulo’s Vice President of Worldwide Sales, joins me to discuss the Qumulo architecture and how it meets these new demands.

Sign up for our Newsletter. Get updates on our latest articles and webinars, plus EXCLUSIVE subscriber only content.

About George Crump

George Crump is the Chief Marketing Officer at VergeIO, the leader in Ultraconverged Infrastructure. Prior to VergeIO he was Chief Product Strategist at StorONE. Before assuming roles with innovative technology vendors, George spent almost 14 years as the founder and lead analyst at Storage Switzerland. In his spare time, he continues to write blogs on Storage Switzerland to educate IT professionals on all aspects of data center storage. He is the primary contributor to Storage Switzerland and is a heavily sought-after public speaker. With over 30 years of experience designing storage solutions for data centers across the US, he has seen the birth of such technologies as RAID, NAS, SAN, Virtualization, Cloud, and Enterprise Flash. Before founding Storage Switzerland, he was CTO at one of the nation's largest storage integrators, where he was in charge of technology testing, integration, and product selection.

Tagged with: Cloud, Data Sharing, Hybrid, IoT, M&E, Metadata, Migration, NAS, NFS, Qumulo, Retention, SMB, Unstructured data
Posted in Blog