Scaling is a capability that almost every vendor claims but few execute well. Most vendors claim to provide scalability via a scale-out storage system, but scale-out by itself is not enough to address all the scaling demands of a Commercial HPC environment. Most Commercial HPC environments are new data sets to the organization and as such can start with relatively small capacities and modest performance expectations. As the environment moves into production, it can scale quickly, and demands for both more capacity and higher performance come suddenly. New nodes and even upgrades to new technology need to happen seamlessly without disrupting ongoing HPC operations.
Start Small
A clustered file system is a standard method for implementing a scale-out storage strategy. The problem is that most turnkey or integrated solutions start too large for the initial Commercial HPC deployment and it may take years for the organization to utilize the initial capacity efficiently. That reality leads many commercial organizations to attempt to leverage existing storage solutions in the data center. In an attempt to avoid overbuying computing or storage resources, IT often initially implements Commercial HPC projects on legacy storage systems or homegrown HPC deployments. However, neither of these choices is optimal for the Commercial HPC project, and at some point, requires migration to a purpose-built HPC solution.
Ideally, the Commercial HPC customer should find a solution that can start as small as they need but can also scale as big as is required. Getting the right HPC solution not only enables the organization to start small, it also enables them to move into production with the same technology as they develop and deploy. Also eliminated is the migration to a new system as is the need to retrain personnel.
Scale Large
As the HPC project moves into full-scale production, the organization then faces the opposite problem, making sure the system can scale large enough to continue to meet the capacity demands of the project. Scaling out requires meeting several challenges. First, the system has to integrate new nodes into the cluster successfully, since additional nodes provide the needed capacity and performance. However, adding another node is not always as straightforward as it should be. Many systems require adding the node manually as well as manually rebalancing data from other nodes to the new node.
The Commercial HPC storage customer should look for an HPC storage system that can grow with them as their needs evolve. It should start small during the initial phases of development but scale large as the environment moves into production. The system should make the process of adding nodes as simple as possible; automatically finding available nodes, adding them to the cluster automatically and automatically rebalancing cluster data without impacting storage performance.
Eventually the nodes a customer adds to the storage cluster change. A properly designed, scale-out HPC storage cluster should meet the organization’s HPC requirements for years. It is vital then that the new system can incorporate new technology from within the existing nodes.
Take Metadata Seriously
Another scaling problem is specific to HPC, managing metadata as the system scales. HPC applications typically access massive amounts of unstructured data sequentially to perform analysis on that data. The organization continues to add more data to the HPC environment to improve accuracy. The growth in the data set consists of thousands, potentially millions, of individual files. These files can range in size from very small to very large.
The HPC storage system not only has to handle a variety of file sizes, it also has to scale metadata processing. If the storage system cannot scale its metadata processing, it may hit a scaling limitation long before it runs into a limitation of the cluster itself. When the storage cluster has a metadata processing problem, the addition of nodes becomes a case of diminishing returns. The computing resources allocated to managing metadata, do not scale as the cluster itself scales.
All the nodes in a cluster should process metadata instead of just a couple of control nodes. Spreading the workload across all the available cluster nodes allows the system to run at high rates of utilization, which improves performance and dramatically lowers costs. The cluster should also store metadata separately from the data itself, which again should improve metadata scaling and performance as well as increasing metadata durability.
StorageSwiss Take
Many vendors claim scale-out capabilities, but most don’t scale in a way that meets the demands of Commercial HPC. They often don’t start small enough to get the project off the ground, or they don’t scale enough to meet the long-term needs of the Commercial HPC project. They also lack the proper metadata management to ensure the storage cluster is working as efficiently as possible. There are HPC systems, like those from Panasas, which can meet the scaling demands of the Commercial HPC environment. They can start small, grow big and grow efficiently through proper metadata management.