High-Performance Computing (HPC) is no longer limited to the halls of academia and large government labs. Commercial HPC use cases are on the rise. Many organizations now have a large pool of unstructured data that requires analysis. The challenge for most commercial organizations is the HPC project starts as a pilot project that is at some point “thrown over the wall” to IT to store and manage. IT initially tries to leverage its knowledge in traditional enterprise storage systems but quickly learns that HPC is a different animal with different requirements than mission-critical databases or virtual machines.
The next step for the commercial organization is to investigate the various HPC specific storage solutions. IT often finds these solutions, built from a combination of open source file systems and commodity hardware, too time-consuming to assemble and operate.
IT needs an integrated, enterprise-ready HPC specific storage solution. One that meets the HPC IO requirements while at the same time meeting the commercial organization’s desire for rapid time to value. The problem is that traditional enterprise vendors try to extend their solution, so it looks more HPC like and traditional HPC vendors try to bundle their solutions to make them look more integrated, leading to confusion among IT personnel.
The goal of this blog series is to provide IT professionals with a checklist that they can use to make sure that their HPC storage selection meets all the requirements of the commercial HPC use case.
Item 1: Direct Access
The typical answer to most HPC storage problems is a scale-out file system or scale-out NAS. A scale-out NAS consists of multiple servers (nodes), each of which contributes their internal storage to the cluster, creating a centralized pool of storage. Traditional scale-out file systems however, “shard” or stripe data across some number of nodes in a cluster. Scale-out systems claim to eliminate both capacity and performance concerns because each additional node adds to the cluster’s potential capacity and performance. The claims of scaling capacity are for the most part accurate but the claims of scalable IO are not.
The typical scale-out cluster creates an IO problem for itself. When a client or application requests a file, each node that has a shard of that file responds with its shard. The requesting client or application though has no idea how to reassemble these shards into the original file. Therefore, most scale-out NAS / file systems have a control node or gateway that does the reassembling of the file from the shards before passing it back to the requesting client or application.
In a read-heavy environment with mixed file sizes, a common attribute of Commercial HPC, these control nodes or gateways can become overwhelmed with IO requests. Most scale-out file systems have either no, or a limited ability to scale the number of control nodes. As a result, the IO bottleneck becomes worse as the number of nodes increases because the control node has to reassemble data from a higher number of data nodes.
The solution is to provide clients with the intelligence to reassemble the shards themselves and not have to go through a control node. Panasas, for example, provides DirectFlow®, that enables Linux and macOS clients to access the data nodes directly, reducing the control node’s role to metadata management and the background operating system tasks. This directs the “heavy lifting” of data to the client.
The advantage of direct access to data from the client is two-fold. First, the client or application requesting the data experiences better IO performance because it is directly accessing the data itself instead of going through the single gateway. It is truly parallel. Multiple clients can access data simultaneously, and the nodes can respond to each client in parallel. Second, the overall cluster is more scalable since the control nodes aren’t bogged down with IO requests. Their processing can be more dedicated to metadata and cluster management functions.
The downside to direct access is loading client-side software so the client can perform the shard reassembly function. Since most HPC functions are either Linux based or macOS based in the case of media and entertainment, the storage solution vendor can provide support for these two platforms and have majority of the market covered. The Commercial HPC provider, however, should provide a gateway functionality delivering support for more traditional protocols like SMB and NFS. The gateway is particularly useful when reading in data from devices (IoT and Medical devices for example) that can’t have a client installed or aren’t running Linux or macOS.
A given for Commercial HPC storage is that the architecture is asked to scale well beyond its initial deployment. Additionally, dozens, if not hundreds or more, users or applications simultaneously access the storage cluster. Delivering consistent performance throughout the storage architecture’s lifecycle is critical to a successful HPC storage design. Without the ability to scale performance along with capacity the HPC storage solution fails to meet the long-term needs of the organization, especially knowing those expectations increase over time. This criticality of scale is why direct access to the cluster’s data nodes is number one on the checklist.