Solving DR’s Two Biggest Failures – Documentation and Testing with Orchestration

Posted on September 24, 2019 by George Crump

During a disaster, there is a lot that can go wrong, and IT needs to expect the unexpected. In most cases though, it is not the unexpected that causes a disaster recovery (DR) effort to fail. The primary reasons that disaster recoveries fail is a lack of documentation and a lack of testing. The need to prepare is more critical than ever since, thanks to cyber-attacks like ransomware, there is no such thing as a geographically safe data center. IT needs to continuously test various disaster scenarios as well as document the DR process.

The Criticality of Documentation

Documentation is vital to the DR process. IT should have the goal of creating documentation so concise that any IT professional, including an IT professional from outside of the organization, can follow the documentation and recover critical systems quickly. IT personnel outside of the organization may be necessary if the organization’s core team is unable to make it to the DR site. The DR documentation should also carefully note application interdependencies and establish recovery order so then when IT recovers applications themselves, all the components they need to operate correctly are already available.

DR Documentation Needs to Live

In most cases, DR documentation is written once and never updated again. The problem is that with each change made to the environment, the original DR documentation becomes less accurate. IT needs to integrate continuous updating the DR documentation as part of its change control workflow.

The Criticality of Testing

Most IT professionals won’t argue the need for testing, but most will claim they don’t have the time and resources to test as frequently as they should. If a formal DR testing schedule exists at all, it is often once or twice per year, which isn’t frequent enough given the typical rate of change in an organization’s data center. When a large-scale disaster strikes, tensions are high. The disaster may be personally impacting IT personnel. Testing develops a muscle memory so that applications and data sets can return to service even if IT is distracted. Frequency of testing needs to increase from once or twice per year to once or twice per month.

Documentation and Testing Work Together

DR tests often have failures within them. IT needs to document why those failures occur and how to correct them, then add that information to the DR documentation. If documentation is continuously updated, then IT can improve DR times with each successive test while also reducing the number of surprises. Updating documentation during DR testing eliminates having the same “surprise” occur on successive DR tests.

DR Documentation and Testing Versus Reality

The goals of DR documentation and testing are noble, but unfortunately, reality eventually sets in, and IT is left doing the best it can with available resources. Documentation is usually the first casualty as keeping it up to date is a time consuming and tedious task. There is also seldom anyone specifically assigned to maintaining documentation. The next casualty is testing. An initially zealous testing plan of once or twice per month usually falls victim to other more pressing tasks; after all, there is no apparent disaster on the horizon. As a result of lack of updated documentation and only occasional testing, most disaster recovery efforts eventually devolve into a series of best efforts, where in the case of a real disaster, IT need to scramble to pull victory from the jaws of defeat.

Using Orchestration to Overcome DR Documentation and Testing Realities

Given the harshness of reality, IT needs to look to technology to help them continue to meet the DR needs of the organization and to make sure that when disaster strikes, recovery looks like a logical process instead of a fire drill. A DR automation system enables IT to set specific workflows for the disaster recovery process. Administrators can link in application interdependencies and make sure that recoveries happen in the correct order. The software can also perform soft tests of the DR process and alert IT administrators of potential problems before they spend time testing or going through an actual disaster.

How DR Orchestration Can Improve Success while Lowering Costs

With automation, the DR process becomes a literal push-button process. Once the DR button is pressed, the orchestration solutions execute the script. During a test, if an error occurs or if a change requires reprioritizing application recovery order, IT merely makes changes to the orchestration’s workflow. In doing so, the DR documentation is also automatically updated.

DR Orchestration Enables Scale

A disaster does not consist of one recovery; it is a series of potentially hundreds of restorations depending on application criticality, disaster type, and length of time that the primary data center is unavailable. Managing the potentially dozens of recoveries that need to occur along with the permutations of restorations based on the type of disaster is almost impossible to keep track of, let alone update. DR orchestration enables a small amount of IT personnel to initiate recovery at scale from a wide variety of backup storage types.

Conclusion

DR documentation and DR testing are still fundamental elements of a successful disaster recovery process. A lack of time or resources doesn’t make them any less critical. In fact, it makes them more so. IT needs to leverage technology to assist them in meeting DR requirements so that recovering from a disaster is an organized, tested and documented workflow instead of a fire drill.

Watch On Demand

Sign up for our Newsletter. Get updates on our latest articles and webinars, plus EXCLUSIVE subscriber only content.

About George Crump

George Crump is the Chief Marketing Officer at VergeIO, the leader in Ultraconverged Infrastructure. Prior to VergeIO he was Chief Product Strategist at StorONE. Before assuming roles with innovative technology vendors, George spent almost 14 years as the founder and lead analyst at Storage Switzerland. In his spare time, he continues to write blogs on Storage Switzerland to educate IT professionals on all aspects of data center storage. He is the primary contributor to Storage Switzerland and is a heavily sought-after public speaker. With over 30 years of experience designing storage solutions for data centers across the US, he has seen the birth of such technologies as RAID, NAS, SAN, Virtualization, Cloud, and Enterprise Flash. Before founding Storage Switzerland, he was CTO at one of the nation's largest storage integrators, where he was in charge of technology testing, integration, and product selection.

Tagged with: Backup, Cloud, Deduplication, Disaster recovery, dr, DR Orchestration, Object Store, Replication, Retention, RPO, RTO, Veeam, VM
Posted in Blog