Part I – Protecting Applications From Storage System Failures
When most IT professionals think of a disaster recovery plan, they think big, really big, like an earthquake, fire, flood, etc., where the data center is totally inaccessible. To protect against this scenario elaborate data replication technologies are usually deployed with the primary focus of getting data out of the building. What is missing from most disaster recovery plans is accommodation for the smaller, less newsworthy events like a server outage, storage failure, or application outage. These can seem too small to plan for or simply be forgotten, yet they can end up being the most costly disasters of all.
Major disasters have one key advantage, excusability. When a data center is out of service and the CIO can point to the local newscast televising the event, users, executives and even customers will likely be more patient and allow the recovery process to occur. Conversely, minor disasters are not afforded that luxury, as downtime caused by ‘routine disasters’ is less tolerated. With these minor events there is a higher expectation of returning to service and to doing so with less data loss.
Unfortunately, the system failures that cause the minor outages are far more likely to occur than the act of God disasters that capture all the media attention. Some studies show that an application outage caused by server or storage failure can happen as often as once per month. While all disasters impact user productivity, these minor disasters can be more problematic, in a way. Since there are other processes that are likely to be running, having one application down can become a choke point to an organization’s primary delivery of product or service. This can eventually delay revenue recognition and cost the company money. Clearly, keeping business critical applications available to users should be a critical component of any disaster recovery solution.
Potentially more severe than lost productivity is the impact of an outage on the customer experience. If systems are down or even slowed because a background recovery process is occurring, then customers, especially in today’s online world, are more likely to go somewhere else to get what they need. Not only is this a direct loss of that particular sale, it could also be a long term loss of repeat business when that customer decides not to come back.
What makes minor disasters so hard to recover from is that they are often not planned for when an application project is started. Typically these plans include an elaborate data replication strategy but do not accommodate something simple like a server going down or a simple storage failure. This oversight occurs because the people planning the application project are not typically the people on the storage team. While project owners may inform the storage team of capacity requirements, RAID configurations and data protection needs, they often don’t communicate data recovery or performance needs. As a result application owners and storage managers are left scrounging available resources to come up with a viable workaround for availability.
The problem is that current data protection resources are very application-specific and must leverage what’s available. In most cases this means using typical storage system solutions like snapshots, replication and backups, or OS/application capabilities like mirroring and transaction log shipping. While all of these techniques have value for general purpose use, like file servers, they’re not adequate for applications that the business relies on to produce its work. These applications may include more than those typically deemed ‘mission critical’, but also applications that are part of the workflow. Both types need protection beyond simply what storage systems can provide.
The largest potential problem is that none of these techniques are viable if there is a storage hardware failure. While the data may be available at the remote site or on a local backup device (disk or tape hardware) that data must be transferred back to another storage system for recovery. That also means that the recovery storage system needs to be prepared ahead of time for the inbound data and it must have the capacity and performance that the applications need, as well as being accessible to the server the application’s running on. If any of these steps are not prepared for ahead of time the organization is penalized with a longer outage and even more productivity loss.
Even for less severe situations like a drive failure, the existing recovery techniques can be less than ideal. For example, many RAID rebuild efforts can take double-digit hours to complete. Accomplishing a rebuild in this amount of time often means giving it top priority over storage resources, which will impact application performance significantly. Many RAID storage systems will allow the priority of the rebuild to be changed in order to reduce its impact on application performance. But that means the data will be exposed to a second drive failure and potential total data loss for a longer period of time, sometimes days.
Storage services like RAID, replication and snapshots are not ‘aware’ of much outside the loss of a connection to the physical servers. While most operating system-level protection strategies like mirroring or clustering solutions are aware of a physical hardware failure they are often not aware of ‘in application’ issues like data corruption, application freeze or decreases in application performance. And operating system-level products often require shared storage systems. This moves the ‘weak link in the chain’ back to the storage system.
To solve these issues customers need to turn to application availability solutions like those from the Neverfail Group. These software based solutions monitor and protect the application itself by taking an application view towards systems availability. They perform their own data replication function and can start the applications on another physical or virtual machine in the event of a hardware failure. These solutions are software based and not only solve the shortcomings of storage system-level protection solutions, they also solve the shortcomings of operating system-level or cluster level availability solutions as well, all while providing greater flexibility and cost effectiveness.
The Neverfail Group application-aware availability solutions monitor ‘from the application out’. This allows them to make sure that the application is performing at the desired level and is up and running. While they can leverage shared storage it isn’t required, and the redundant copy of data is often stored on a separate physical array. This means that the application is also protected from double-drive RAID failures or complete storage system failures. Finally, they compliment the major disaster recovery plan as well by allowing remote preparation and launch of applications in secondary sites.
Adding an application aware capability can fill a critical gap in your current disaster recovery plan. In addition to supporting the catastrophic DR scenario, it also provides coverage from the disaster you are most likely to experience this year, potentially twelve or more times.
In our next article in this series Storage Switzerland will compare the various methods that data centers use to provide availability to application specific recovery products.
Neverfail Group is a client of Storage Switzerland