Recovering from a Storage System Failure

Posted on July 3, 2015 by George Crump

The Bad Day Begins

A message arrives on your smart phone indicating that the organization’s primary storage system is offline. All the applications that you were counting on it are down, and users can’t get their work done. Further investigation determines that this is the worst case scenario, the storage system has either suffered multiple drive failures or it has stopped working all together. There are two big challenges. First, a full recovery of data needs to occur, hopefully all those backups worked. Second, even if the backups are good, there is nothing to recover to. The primary storage system needs to be fixed or replaced before being able to start the recovery process. This is the definition a bad day.

What Causes Storage System Failure?

The good news is a storage system failure as described above is rare. But, while this type of failure is rare, the chances of facing it are higher than the chance of a data center being destroyed by some type of natural disaster. We plan for full data center destruction but not for the more common storage system failure. The reality is that the consequences of this type of failure severe enough that IT needs a plan for when it does occur.

A complete storage system failure can occur for a variety of reasons but generally it is caused by a hardware or software failure. From a hardware perspective, the most common event that would cause a complete system to go down is multiple failed drives to the point that the redundancy of the RAID protection is exceeded. The drives need to be replaced and the data recovered.

The second cause of failure is a bug in the software that drives the hardware. As storage hardware becomes increasingly “software defined”, the chance of that software having a bug could increase. We are expecting a lot from storage software today including snapshots, clones, replication, automated tiering and cloud connectivity. As a result there has been a rise in the number of software related issues that can cause a storage system failure. This includes volumes that disappear for no reason, volumes that get corrupted, etc. These failures require that something get fixed, drives, motherboard, etc…, then the restore process can begin.

If there is a software failure, the hardware is typically usable but will probably need reformatting and have data restored to it, but at least there is hardware to restore to.

Protecting from Storage System Failure

Protecting from storage system failure requires more than just copying data from point A to point B. To meet any reasonable recovery expectation requires that a secondary storage system be available to start the application on. The good news is secondary storage is very affordable today and these systems can play a larger role than just being a standby storage system for the primary storage system. But before implementing the secondary storage system, the IT planner needs to understand what the acceptable recovery point (RPO) and recovery time objectives (RTO) will be for the applications counting on that system. Once the RPO and RTO is understood IT planners will know what method they should use to get data to the secondary storage system and what type of secondary storage system they should buy.

Storage Switzerland has covered this topic in-depth recently. We’ve provided two webinars and two podcasts that discuss two methods to protect from storage system failure. The first webinar/podcast combo discusses how to use replication and the cloud to protect from a storage system failure. The second webinar/podcast combo discusses using backup up and recovery-in-place to protect from a storage system failure. We urge you to listen to all four and make sure that your “bad day” is no big deal.

About George Crump

George Crump is the Chief Marketing Officer at VergeIO, the leader in Ultraconverged Infrastructure. Prior to VergeIO he was Chief Product Strategist at StorONE. Before assuming roles with innovative technology vendors, George spent almost 14 years as the founder and lead analyst at Storage Switzerland. In his spare time, he continues to write blogs on Storage Switzerland to educate IT professionals on all aspects of data center storage. He is the primary contributor to Storage Switzerland and is a heavily sought-after public speaker. With over 30 years of experience designing storage solutions for data centers across the US, he has seen the birth of such technologies as RAID, NAS, SAN, Virtualization, Cloud, and Enterprise Flash. Before founding Storage Switzerland, he was CTO at one of the nation's largest storage integrators, where he was in charge of technology testing, integration, and product selection.

Tagged with: Accelerite, Cloud, Data Protection, ExaGrid, Recovery in Place, Replication, RPO, RTO, Veeam
Posted in Blog

Recovering from a Storage System Failure

The Bad Day Begins

What Causes Storage System Failure?

Protecting from Storage System Failure

Share this:

Related