Our previous blog highlighted the challenges of supporting artificial intelligence (AI), machine learning (ML) and deep learning (DL) workloads with legacy file systems. Control node bottlenecks, inferior (or lack of) non-volatile memory express (NVMe) drivers, and inefficient capacity utilization are among the pain points that come with trying to process the millions or billions of small files that typically comprise AI, ML and DL workloads with legacy network-attached storage (NAS) approaches.
In this installment, we will explore the hallmarks of the modern file system architecture. Notably, it should be optimized to fully capitalize on the performance acceleration offered by NVMe while at the same time optimizing I/O performance. Furthermore, it should offer integration with cloud compute and storage resources for cost efficiency. At the same time, a distributed architecture is critical to optimizing data protection.
Built-in NVMe Support
NVMe is a storage protocol designed to accelerate the transfer of data between host systems and solid-state drive (SSD) storage media, over the server’s peripheral component interconnect express (PCIe) bus. NVMe can enable the enterprise to more fully utilize the SSD’s maximum performance levels by increasing command counts and queue depth – a key value proposition when it comes to serving AI, ML and DL workloads. To do so, however, the storage infrastructure must be architected correctly; NVMe exposes any storage infrastructure bottlenecks because it is so latency efficient. Legacy NAS architectures were not designed to take advantage of NVMe. For instance, file servers continually request information about metadata before they execute operations, adding significant communication overhead. This precludes the ability to fully exploit potential performance acceleration.
Excellent Performance with Small Files
Another significant change that must occur to the file system architecture in order to support AI, ML and DL is the ability to support rapid inspection of very small files. Legacy file system architectures were designed for workloads such as high-performance computing (HPC) that require rapid processing of large files. AI, ML and DL workloads, on the other hand, require equally fast processing but of millions (or billions) of small files. This creates a situation whereby metadata access requests, which typically account for anywhere from 70% to 90% of data requests being served by a NAS system, become the bottleneck. As a result, the modern file system must be written to continue to deliver high levels of network bandwidth, but at the same time extreme levels of IO performance, to ensure utilization of central processing unit (CPU) and graphics processing unit (GPU) resources.
Cloud Integration
Storage Switzerland sees the hybrid cloud model as being cost effective, and as a result popular among enterprises, for hosting AI, ML and DL workloads. Organizations may want to temporarily shift workloads to the public cloud for peak processing demands but then later bring those workloads back on-premises for normal conditions. The problem is that legacy NAS architectures were not designed to enable seamless portability of data and workloads between on and off-premises infrastructure resources, or for the ability to run workloads in parallel across these resources – both of which are required for a true hybrid cloud architecture.
Storage Switzerland views the ability to access and pay for compute resources on demand as one of the most effective use cases of off-premises cloud services. When it comes to AI, ML and DL, many enterprises are looking to get started quickly and for as limited an overhead (including upfront investment in infrastructure and ongoing management of infrastructure) as possible. This makes the ability to burst AI, ML, and DL workloads to the cloud on a temporary basis for processing appealing. Other enterprises have invested in some on-premises CPUs and GPUs, but have the need to also run some workloads in parallel in the cloud due to temporary spikes to compute and storage needs, and to then compare these results to analytics jobs conducted on-premises. Finally, the ability to tier data across on and off-premises storage resources can help to control costs; data should be tiered from most expensive and fastest-performing on-premises SSD media, to lower-cost object storage services, depending on how frequently it is accessed by the enterprise.
Cost-effective Data Protection
Finally but far from least importantly, the need to serve millions or more than a billion of very small files increases the risk that storage media will fail or data might be corrupted; these realities coupled with the need to provide demanding performance levels during a failed state creates new data protection and disaster recovery requirements. The scale-up architectures of traditional NAS systems mean that it will take a long time for the system to return to acceptable levels of performance in a failed state, because all processes must flow through the single head node. This under-utilization quickly becomes very expensive, with the introduction of more expensive processors and storage media and networking.
As a result, a distributed approach that enables rebuilds to be spread out across nodes is needed. Also required is an Erasure Coding type of protection scheme that optimizes capacity utilization while increasing resilience. Erasure Coding requires more processing however so the modern file system much also distribute the data protection load across node computing power.
In our next installment, we explore the challenges of relying solely on benchmarks when evaluating next-generation file architectures.
Sponsored by WekaIO
Interesting article, I am very impressed with the points that were brought up and the aspect of how AI and NVMe will work together on a high velocity file-system. A good article that covers three high-performance file-systems are HDFS, Cloudshare and PVFS (https://www.researchgate.net/publication/241470073_Mixing_Hadoop_and_HPC_workloads_on_parallel_filesystems)
Hadoop Cloudshare – The CloudStore filesystem, formerly called KosmosFS (KFS), is a distributed file-system similar to HDFS, and tailored for Hadoop workloads.
HPC PVFS – PVFS[3], a parallel filesystem suited to high-performance computing workloads, uses a considerably different data layout. Files are split into small chunks (64 KB by default)
HADOOP (Term Frequency-Inverse Document Frequency or TFIDF) – Hadoop TFIDF workload…The classifier uses a Term Frequency-Inverse Document Frequency (TFIDF) metric[6]. For brevity, we refer to this as the TFIDF workload. The workload partitions a large set of HTTP requests among a number of Map tasks; each Map task reads a previously generated 69 MB model file, and uses it to classify its share of the dataset.
Security aspects:
As our Hadoop workload, we use a Hadoop implementation of an HTTP attack classifier; which detects malicious code in incoming HTTP requests
Performance – That the slowdown is always less than half suggests that neither workload is using the CloudStore file-system near its full capacity. Since the IOR benchmark does nothing but write data, this suggests that CloudStore is not well-suited out of the box for check pointing workloads. This is reinforced by the PVFS IOR N-1 baseline results we present below: PVFS is particularly well suited for this workload, far exceeding the performance of CloudStore.
From the results, it seems that PVFS outperforms CloudStore file-system at every turn. This experiment was done by the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. This maybe something you can use.
Also, there is a file-system called Lustre, there are some interesting statistics that came up from the research.
Lustre can deliver over 1 TB/s across thousands of clients. For example, the next-generation Spider Lustre file system of Oak Ridge Leadership Computing Facility (OLCF) at Oak Ridge National Laboratory (ORNL) is designed to provide 32 PB of capacity to open science users at OLCF, at an aggregate transfer rate of 1 TB/s. (Interesting enough, AWS uses Lustre as well for their HPC implementation ( Amazon FSx for Lustre ).
Lustre Reference – http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.703.459&rep=rep1&type=pdf (Oak ridge National Laboratory provided the research)
PVFS and Lustre seem to be leading the charge when it comes to HPC (ZFS, XFS and GPFS are interesting alternatives).
Todd