Most storage systems provide several layers of protection. At a minimum, there is protection from media failure, typically delivered by some form of parity based protection scheme. In enterprise systems, it is also common to have built-in redundancy in both the hardware and the software. Commercial HPC has these same requirements and expectations but how the traditional HPC storage environment delivers those, causes issues for the enterprise use case.
Commercial HPC, as compared to traditional enterprise workloads, requires much more capacity to store its data. That data is usually unstructured, but file sizes can range from the very small to the very large. Commercial HPC also requires more performance, but its performance tends to be bandwidth driven, consequently the ability to stream data is critical. The combination of high capacity and bandwidth focused high performance means that Commercial HPC storage systems need to implement data protection differently from traditional enterprise storage systems or even traditional HPC storage systems.
Data protection is also more critical on Commercial HPC storage systems. The data that these systems store may be the last known good copy of data or it may be storing a copy of data that is impossible to recreate. Additionally, because of the time required by some of the workloads, some jobs can take hours, days or even weeks to complete. Commercial HPC cannot afford an interruption or a performance loss because a media failure degrades performance.
Requirement #1 – Triple-Parity Protection
The first requirement of a Commercial HPC storage system is triple-parity data protection. Since the capacity requirements of Commercial HPC can dwarf the rest of the enterprise, the protection scheme used must be efficient. If the environment stores 500TBs of data, forcing the organization to store 1PB or even 1.5PB to maintain a protected state may not only exceed the budget it can cause shortages in available floor space.
Many HPC storage solutions provide only replication for data protection. Replication protects against media failure within a node by creating two or three additional copies of data on other nodes in the storage cluster. The problem is a replication only model forces the organization to store two or three full additional copies of data. While replication does maintain performance during a failure, the level of exposure to an additional failure is enormous. Most enterprise storage systems support a single or dual parity protection scheme. While parity does not have the capacity waste of a replicated system, it can hurt storage performance if the design of the storage system cannot maintain performance during a failure/rebuild process.
A Commercial HPC storage system needs to provide a parity-based protection scheme, so they do not waste capacity nor unnecessarily waste data center floor space. Because restarting of workloads is so time-consuming it also needs to have multiple layers of redundancy so that one or two drive failures don’t stop an HPC process from executing.
Additionally, most HPC storage systems don’t have distributed drive sparing. If a drive fails, an administrator needs to get involved and replace the failed drive. Given the number of drives common in a Commercial HPC environment, just finding a failed drive is challenging. Even after identifying the failed media, it likely takes core IT quite some time to replace it, which means increasing the level of exposure for the Commercial HPC workload.
Commercial HPC Systems need to have distributed spares that are globally available to the system. With a distributed spare, when a drive fails a spare is automatically assigned to replace it. Distributed spares mean that rebuild can occur immediately upon failure identification, instead of waiting for the administrator to replace it and initiate the rebuild process.
Every data center though, has to deal with a media failure but Commercial HPC may be in a failed state more often, just because of the number of drives it may deploy to meet the capacity demands. During a failure, the storage system is one step closer to complete data loss, so a rapid rebuild is critical.
Many enterprise storage systems, which use parity, have lengthy rebuilds especially as hard disk drive sizes continue to increase. In most cases, a rebuild requires a complete reading of the drive, so the more massive the drive, the longer the rebuild. The Commercial HPC system has to ensure that its software is efficient enough to read only the portion of the drive that contains data. It also needs to ensure it leverages available processing to perform the rebuild quickly.
Maintaining performance during the failed state is also critical. The Commercial HPC team likely has jobs running all the time, and it can’t afford to restart them but it also can’t have their processing slow down during a failure. Any parity-based system is going to impact performance during both a parity calculation and during a rebuild process. The Commercial HPC system has to take special precautions to make sure that the workloads counting on the system don’t see a loss in performance.
Finally, more can fail on a storage system than just the media. The Commercial HPC system needs to maintain availability 100% of the time. Again, the length of job time means that restarting a job from the beginning can be very time-consuming. The systems need to have complete redundancy in both hardware and software. While many systems have redundant hardware, they don’t offer redundant software. Software redundancy allows a second copy of the file system to take over if the primary software fails.
StorageSwiss Take
Commercial HPC storage systems need predictable protection. They need to provide continuous access to data no matter the type of failure. They also need to make sure that performance in the failed state is similar to that of production. With capacities commonly measured in the petabytes, Commercial HPC protection also needs to be efficient with its protection, and adding 200 or 300% of capacity to the environment to maintain protection is not going to allow the organization to continue to scale its investment.