Falling Backward – Disaster Recovery Plans have to Include Fail Back

When creating or testing a disaster recovery (DR) plan most IT professionals tend to focus on getting the data out of the building and safely stored in a remote location. Much of the success or failure of a DR plan is dependent on the ability for users to access applications after the DR event occurs. Not surprisingly much of the focus of a DR plan is on getting through the actual DR event and not on what will happen after the threat has passed. The fail back step in the DR process is potentially as critical as any other step in the process.

What is Fail Back?

Fail back is the process of returning the primary data center to production after a disaster. This involves identifying which data was changed at the DR site while it was standing in for the production data center and then copying that data back to the primary data center.

The primary reason that fail back is so important is that in many cases when a disaster is declared, it is because there is the potential for harm to the primary data center. However, when that threat of disaster does not actually impact the data center, it is important for the business to return to normal operations. An ideal example of this is a data center in the path of a threatening hurricane. But as hurricanes often do, they change course after the organization has declared a disaster.

DR Plans

The threat level and the probability that a disaster will strike the primary data center often dictate what tasks are activated in the DR plan. A threat with a low probability may just mean making sure the remote site is prepared, a higher probability threat may require that applications and a recent copy of data is available at the DR site. An imminent threat may trigger applications and organizational services to be hosted from the DR site. It is this last scenario where fail back becomes the most difficult.

In an imminent disaster situation, data is now being added and updated at the DR site often by remote users not physically located at either site. The longer this condition continues, the more out of sync the primary site becomes and the more difficult it is to return operations to the primary site.

If the imminent threat does not occur, the primary site may still be operational. In many cases it still has power, Internet connectivity and the ability to receive data. However, it may not have any actual end users on premise.

It is important to maintain connectivity to the primary site for as long as possible so that when it comes time to fail back, the amount of data that needs to be resynced between the DR site and the primary site is minimized. In other words, the DR site becomes the source location and what was the primary data center is now the target location. In the same way that replication to the DR site optimizes bandwidth by only sending changed data segments, in similar fashion the primary site can be updated from the DR site with the same bandwidth efficiency. This data flow can continue until one of two scenarios occurs. The first is the imminent threat does actually occur and connectivity to the data center is lost. The second is nothing occurs and the primary site is ready to return to operations.

In the first scenario, loss of the primary data center, most often this is still just a connectivity loss, not a physical loss of the entire site. In this situation keeping the primary site up-to-date for as long as possible will minimize the amount of data that needs to be replicated when power and connectivity is restored.

In the second scenario, the primary data center stays operational; all that has to happen is the promotion of the primary data center to its original role as the source since the replication process has been continually updating the primary data center. Again products like Vision Solutions’ Double-Take automate this process and allow a pushbutton fail back to the primary site. Once fail back is complete the DR site can be set once again as the target and replication can resume. The result is minimal data movement at critical moments and a fast return of operations at the primary location.

Fail back is a critical element often overlooked when planning for disaster. It is something that should not be forgotten because more DR events end up being false alarms than actual occurrences. While these false alarms have value in verifying the validity of the DR plan, you do not want to be delayed in returning to the primary data center because of a false alarm. Products like Vision Solutions’ Double-Take provide the answer by allowing for rapid failover to the DR site, continuous updating of the primary DR center while it is under watch and simple fail back when the potential disaster has passed.

Twelve years ago George Crump founded Storage Switzerland with one simple goal; to educate IT professionals about all aspects of data center storage. He is the primary contributor to Storage Switzerland and is a heavily sought after public speaker. With over 25 years of experience designing storage solutions for data centers across the US, he has seen the birth of such technologies as RAID, NAS and SAN, Virtualization, Cloud and Enterprise Flash. Prior to founding Storage Switzerland he was CTO at one of the nation's largest storage integrators where he was in charge of technology testing, integration and product selection.

Tagged with: , , , ,
Posted in Blog

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 22,209 other followers

Blog Stats
  • 1,529,151 views
%d bloggers like this: