Other than the obvious people issues, when a disaster strikes what’s most important for the data center? Given the time to think about it, most IT managers will answer the question correctly. The most important thing is to restore access to mission critical and business important applications to the users. Returning services to users though, especially under the duress of a disaster, requires that many carefully planned steps be taken long before the disaster ever happens. As a result it may be more correct to say that disaster planning is most important when disaster strikes.
These planning steps include data movement, disaster testing, disaster detection, remote application startup, and reconnection of the users. All of these steps lead to the ability to a successful return of service. In most cases the absence of any of these steps will lead to a failed recovery effort. The problem is that most disaster recovery plans stop at data movement. But without those remaining important steps the overall goal of restoring services to users may never be met.
Data Movement is Important
Data Movement is the foundational component of any disaster recovery plan and data has to be in a remote location in order for the recovery effort to begin. The importance is the form that the data is stored in at the remote location, a characteristic which impacts its accessibility.
Moving data to a remote location is often done by a backup solution by either replicating a disk-to-disk backup remotely or by shipping duplicate tape cartridges off-site. While extremely cost effective and appropriate for non-critical applications, these systems are typically too slow for recovery of business-important or mission-critical applications. One of the problems with using backup as a disaster recovery component is the accessibility of the data, which is almost always stored in a backup format. When a disaster strikes, the backup software needs to be re-loaded, the backup media has to be read and pushed back across the network to a recovery server. The time to read all the server’s information and then write it will often push a recovery effort outside of the target Recovery Time Objective (RTO). Backup applications have become quite efficient at using deduplication and compression technologies to reduce bandwidth requirements and speed backups. However, at least as of now, the entire data set has to be recovered and there is limited bandwidth efficiency in recovery.
To alleviate this problem some data centers have begun to leverage a storage array’s ability to do block-level replication or to replicate snapshot data. The problem is that these snapshots are dependent on the primary storage system remaining active until the replication job completes. Most snapshot replication techniques are not continual, but happen at intervals as high as once per hour or more. This means that in the event of a disaster the data center may not be getting the latest copy of data saved with the last snapshot. Array based replication is also not host-aware, meaning it has no understanding of what operating system is running on the host or even the health of that host. Finally, while these solutions do save data in a usable format, most array-based replication products require that a similar storage system from the exact same manufacturer be in the disaster recovery location, adding to the infrastructure cost significantly.
The next step up is real-time data replication solutions. These products typically run on the server to be replicated and send information across a network or WAN segment as it changes. While the information can be slightly out of sync it isn’t for long, typically less than a few seconds. These solutions often place data in a usable format that can be immediately accessed, eliminating the need to restore or move data to another host. Additionally, real-time replication products allow for ‘any to any’ replication, meaning that mixed storage systems can be leveraged to help contain costs. And, physical systems in one location can replicate to virtual systems in the DR site.
The problem with most real-time replication products is that they’re not application-aware and as a result may not know that an application has failed or even that a disaster has occurred. This means that they don’t have the ability to interface with the application and flush its cache so that a consistent copy is made in the DR site. They also don’t have an understanding of the application stack. Most applications are a group of servers that are working together to provide the service to the users. For example, there may be a web host, a database server, and an application server that together provide a single service to the users.
The inability to detect failure has lead to the development of application-aware products like those from the Neverfail Group. Application understanding, therefore, becomes the key capability needed to complete the final steps in the process of planning the return of service for users.
Detection Is Important
The ability to know that an application or a site has failed is critical. Application-aware replication products like those from the Neverfail enable data centers to provide failover no matter how an application has become non-responsive. For example, an application can freeze or an operation can lock up but the physical server can still be pinged on the network. Even a lack of acceptable performance should be considered a form of failure. If the application responds so slowly that users stop using it, it’s functionally down. Replication products that aren’t application-aware won’t pick up any of these soft failures.
Easy Testing Is Important
A DR plan that does not work when it’s needed most is not really a DR plan. Continual testing of DR plans to make sure they are viable is important. These plans have to keep up with changes in the infrastructure and to confirm that employees understand how to bring systems back online is critical to ensuring that success. The problem is that many of the above data movement methods make application testing almost impossible or at best make it a weekend-long ‘herculean’ event that’s conducted too infrequently.
Application-aware testing, especially when combined with virtualization capabilities, are ideal when testing disaster recovery plans and greatly simplify the process as well as bring costs down. This is because application-aware DR tools not only understand the application, they also understand the surrounding component stacks of that application. This means that with a click of a button the entire application stack can be brought online in a disaster recovery site in the correct order. For example, when recovering SharePoint as an application its supporting SQL Server must be brought online first. Failure to do so will cause the process to abort.
Application Awareness Is MOST Important
Ironically, returning services is not the most important DR step. In reality it’s simply the result of proper data movement, detection and testing, all parts of the planning process. This makes application awareness the most important enabler of the disaster recovery plan because, in addition to getting data out of the building in near real-time and being able to detect an application or site failure, it also enables a critical aspect of any disaster recovery plan – testing. It also provides the ability to return a usable application service, not just the data, to the users because of its understanding of the complete application stack.
If any of the data movement, problem detection or testing are left out of the disaster recovery process that process is likely to fail when needed most. Application awareness then is most important because it encompasses and enables all of these critical components to make sure when the worst case scenario occurs that the business will be able to survive.
Neverfail Group is a client of Storage Switzerland