Will the foundation of your Disaster Recovery plan collapse?

Posted on September 25, 2014 by George Crump

The ability to replicate data between data centers as it changes is an essential ingredient of any enterprise class storage system. Data centers count on this capability as the foundational component in their disaster recovery (DR) plans. But this foundation is undergoing several seismic shifts that are making the very foundation of DR unstable and combined, and may cause the entire DR strategy to collapse. A DR plan failure can mean loss of revenue, regulatory fines and eventually may cause the failure of the business.

The DR Foundation

The DR foundation for most data centers consists of using array based replication software on enterprise class storage systems. These systems provide a single point of replication for the enterprise’s data set. Array based replication software is connected to the wide area network (WAN) through a storage extension. At the DR site, data is received through an extension switch and to a similar storage system there. The pillars of this foundation are the quality and bandwidth of the WAN segment and the capabilities of the storage extension.

Thanks to the modern always-on, accessible-from-anywhere data center, today’s CIO is dealing with four stress points that are causing their DR foundation, built on these pillars, to fracture beneath them; higher expectations, more business critical applications, massive data growth and increased geographic requirements. In addition to the ramifications mentioned above, failure to address these stress points can have personal impact on the CIO and their IT team. It is critical that IT understands these stress points and moves now to address them. Once one or more of these stress points is compromised, the DR plan is subject to fracture and its total collapse is imminent.

Stress Point 1: Disaster Recovery Expectations are Higher

Less than a decade ago, simply making sure that data was safely off site and available for recovery in the event of a disaster met most organizations’ recovery requirements. Most DR foundations were built on a tape-based process in which tapes were sent off site and then manually recovered in the event of a system or site failure. Users expected to be down for a week or more.

But now, thanks in part to the assumed high availability of cloud applications, IT departments are being held to a higher standard. Companies and users expect nearly instant recovery of business critical applications, regardless of the nature of the disaster. These expectations mean that the replicated copy of the data must be more closely sync’d with the primary copy of data, in some cases within seconds.

In an attempt to address the expectation stress point, data centers have been buying larger amounts of redundant bandwidth from multiples sources. But this action is causing the creation of WAN silos that are not automatically redundant nor are incrementally scaled. Representing a better alternative, DR planners need to look for enterprise class extension products that can aggregate multiple WAN connections from multiple WAN providers into a single connection. Doing so provides scalability and redundancy. It also provides better sharing and utilization of the aggregated bandwidth since an application has access to all of the bandwidth at once.

Stress Point 2: There are More Business Critical Applications

The second stress point is caused by the fact that we live in an “app” driven world. The number of applications in the data center is growing at an alarming pace and the number of these “apps” that qualify as “business critical” is significantly higher than it has been in the past. Traditionally, there were only one or two mission critical applications, like financials or ERP that needed rapid recovery from a disaster. But now there are dozens, if not hundreds, of applications that are considered business critical and need to be included in the scope of the disaster recovery plan.

The increase in applications not only means that more data needs to be transferred to the secondary data center, but also that more individual replication tasks may need to be monitored. The growth in the number of applications that need to be recovered in the event of a disaster has put more pressure on data centers to maintain consistent and reliable connectivity to the disaster recovery site.

While array based replication replicating at block level can support many applications within a replication session it is important to be able to identify which applications need a high level of availability since there is a reciprocal expense at the DR site for each application replicated. In addition to the increase in the number of applications, the number of storage devices that can and need to be replicated off site have also grown. The problem is that each of these will have their own replication task. The sheer quantity of these devices and replication tasks make it difficult to configure, monitor and troubleshoot distance transport support of individual storage devices.

Central management of a large quantity of replication tasks on an extension device is highly beneficial for large scale deployment. Effective WAN bandwidth sharing by independent replication tasks from independent storage devices can be achieved through an extension device that uses QoS and dynamic bandwidth sharing. The alternative, allocating fixed bandwidth to each individual replication task, leads to inefficient use of the WAN bandwidth investment.

Stress Point 3: Massive Data Growth

The third stress point stems from the reality that not only are there more applications, but the size of the data that these applications create and manage is also growing rapidly. Simply put, more data than ever needs to be transferred to a secondary site, with an expectation that applications will return to operation almost instantly after an outage and without data loss.

Data growth may be the single biggest stress point to be concerned with since it somewhat obviously consumes the bandwidth that data centers seem to be buying in droves. The real impact of data growth, however, is the pressure it puts on the WAN ecosystem (array based replication engine, extension device and WAN segment) to be fast with high throughput, consistently available and rapidly expandable. An outage of any sort, for even a few minutes, will cause the replication process to potentially fall hopelessly behind, forcing the execution of a complete data re-sync. And of course while this re-sync is going, keeping all those apps protected from a disaster and meeting user expectations of instant recovery is nearly impossible.

Stress Point 4: Increased Geographic Requirements

The fourth and final stress point is increased geographic requirements. New regulations require that the recovery site has to be further away from the data center than it was in years past. This is being caused not necessarily by an increase in regional disasters, but by the expectation that the organization will be able to maintain operations without interruption, even during a regional outage. In other words, it is a combination of the realities of a more interconnected world and raised expectation of the participants in that world.

While increased geographic requirements does not necessarily create additional requirements of a storage extension product, it does amplify all of the other stress points since more has to be done across longer distances without decreasing reliability or recovery time and point objectives (RTO/RPO).

The State of DR Enablement

It is imperative that IT planners move rapidly to address these stress points so that their disaster recovery plan does not shatter at the foundation. Thanks to array based replication software, which has become standard on both enterprise class storage systems, as well as purpose built backup appliances that leverage deduplication and compression, part of the solution to resolving these stress points exist today.

But it is important to understand that array based replication capabilities alone won’t address and eliminate all of these stress points. The sheer quantity of replication tasks make it difficult to manage and fixed bandwidth allocation for individual replication task leads to poor bandwidth utilization. An extension device is needed to recover from a network error, to provide non-disruptive WAN access to the various replication tasks and increase WAN utilization.

Fortunately, data centers can now apply the most appropriate level of protection to the specific application instead of blanket protection for the data center. Not all applications and not all data are created equally and they don’t need the same level of recoverability in the event of a disaster. But they do need to be recovered at some point, so getting the data to the disaster recovery site in the most cost appropriate fashion is vital.

Critical applications, again whose number is growing, can leverage application-specific replication tools that can properly capture the data and sense when the application has failed. Rapid DR can be applied more broadly using array based replication, allowing for a single replication task to protect much of the environment. Many array based replication tools now have hooks into applications so that data can be captured in a known good state. Finally, less critical data can be moved to DR site via disk backup appliances.

The technology to copy this data to the disaster recovery site clearly exists. Just as important, the raw WAN bandwidth is available from a variety of vendors. As a result, the throughput of that bandwidth is increasing while the cost of that bandwidth is decreasing. But WAN connections still remain potentially the most unreliable part of the disaster recovery ecosystem. These connections frequently generate errors that force transmission retries and, of course, can suffer outright outages. As stated above, it is critical for the WAN connection to stay constant and it is the responsibility of the extension device to provide that consistency by aggregating multiple WAN segments from multiple suppliers into a single scalable, yet highly redundant connection.

DR is a Storage Function

The vulnerability of the WAN connection highlights another problem in maintaining application disaster recovery commitments — the importance of effectively managing and monitoring these connections. The potential for error here is high, often because the WAN connection serves two masters. The WAN is often the responsibility of the network team, but making sure the right data is in the disaster recovery site at the right time is the responsibility of the storage team. To help this conundrum, an extension solution should provide the storage team with a view into the health and performance of the WAN connection, so they can proactively make decisions about bandwidth provisioning and application readiness and, in the event replication tasks experiencing stress, determine which of the WAN or storage devices is actually at fault.

Enterprise Class Storage Extension

The raw capabilities to resolving the stress fractures that threaten the foundation of the disaster recovery plan are in place. However with rapid replication data growth and use of higher bandwidth, and often unreliable WAN connections, extension solutions need to scale up in throughput performance and provide higher visibility of WAN connection health while maintaining non-disruptive WAN connectivity. In short, enterprises need an enterprise class extension device. This device can’t be a faster branch office solution; it has to be an enterprise class device designed for very high bandwidth and large scale deployment.

Summary

The expectations of disaster recovery have changed. Organizations are demanding that more applications be recovered in a shorter time frame than ever before. The problem is that the foundational core of any disaster recovery plan, replication through an extension device across a WAN to a disaster recovery site, is developing stress fractures that may lead to a total DR collapse. While replication and backup solutions that can move data to DR sites are more widely available and are less expensive, they still count on a somewhat fragile WAN connection to complete the job. And while the WAN connection has become both more cost effective and provides more throughput, it still needs to be managed so that errors and outages don’t put the business’s disaster recovery strategy at risk.

It is imperative that IT planners focus on the missing link — the connection between the storage system and the WAN provided by an extension device. This connection needs to be managed by an enterprise-class extension device that can provide scale-up throughput performance, non-disruptive WAN connectivity and management visibility that CIOs need in order to avert a foundational collapse of their disaster recovery plan. Enterprise class extension products that can meet these requirements, solidify the foundation that supports the above pillars are now coming to market and IT planners should be deploying these solutions from storage infrastructure providers before the next disaster strikes.

This Article Sponsored by Brocade

Click Here To Sign Up For Our Newsletter

About George Crump

George Crump is the Chief Marketing Officer at VergeIO, the leader in Ultraconverged Infrastructure. Prior to VergeIO he was Chief Product Strategist at StorONE. Before assuming roles with innovative technology vendors, George spent almost 14 years as the founder and lead analyst at Storage Switzerland. In his spare time, he continues to write blogs on Storage Switzerland to educate IT professionals on all aspects of data center storage. He is the primary contributor to Storage Switzerland and is a heavily sought-after public speaker. With over 30 years of experience designing storage solutions for data centers across the US, he has seen the birth of such technologies as RAID, NAS, SAN, Virtualization, Cloud, and Enterprise Flash. Before founding Storage Switzerland, he was CTO at one of the nation's largest storage integrators, where he was in charge of technology testing, integration, and product selection.

Tagged with: Array Based, Brocade, Disaster recovery, dr, Infrastructure, Replication, RPO, RTO, WAN
Posted in Article

2 comments on “Will the foundation of your Disaster Recovery plan collapse?”

Will the foundation of your Disaster Recovery plan collapse? | Storage CH Blog says:

September 29, 2014 at 1:47 am

[…] Read on here […]
What are some network recovery best practices? | world news says:

September 10, 2015 at 11:23 am

[…] LANs, repairs to servers, routers, hubs and inner wiring will disrupt service. The best approach to residence a LAN outage is to have a minute network map of all LAN components. […]

Comments are closed.