During a disaster, there is a lot that can go wrong, and IT needs to expect the unexpected. In most cases though, it is not the unexpected that causes a disaster recovery (DR) effort to fail. The primary reasons that disaster recoveries fail is a lack of documentation and a lack of testing. The need to prepare is more critical than ever since, thanks to cyber-attacks like ransomware, there is no such thing as a geographically safe data center. IT needs to continuously test various disaster scenarios as well as document the DR process.
The Criticality of Documentation
Documentation is vital to the DR process. IT should have the goal of creating documentation so concise that any IT professional, including an IT professional from outside of the organization, can follow the documentation and recover critical systems quickly. IT personnel outside of the organization may be necessary if the organization’s core team is unable to make it to the DR site. The DR documentation should also carefully note application interdependencies and establish recovery order so then when IT recovers applications themselves, all the components they need to operate correctly are already available.
DR Documentation Needs to Live
In most cases, DR documentation is written once and never updated again. The problem is that with each change made to the environment, the original DR documentation becomes less accurate. IT needs to integrate continuous updating the DR documentation as part of its change control workflow.
The Criticality of Testing
Most IT professionals won’t argue the need for testing, but most will claim they don’t have the time and resources to test as frequently as they should. If a formal DR testing schedule exists at all, it is often once or twice per year, which isn’t frequent enough given the typical rate of change in an organization’s data center. When a large-scale disaster strikes, tensions are high. The disaster may be personally impacting IT personnel. Testing develops a muscle memory so that applications and data sets can return to service even if IT is distracted. Frequency of testing needs to increase from once or twice per year to once or twice per month.
Documentation and Testing Work Together
DR tests often have failures within them. IT needs to document why those failures occur and how to correct them, then add that information to the DR documentation. If documentation is continuously updated, then IT can improve DR times with each successive test while also reducing the number of surprises. Updating documentation during DR testing eliminates having the same “surprise” occur on successive DR tests.
DR Documentation and Testing Versus Reality
The goals of DR documentation and testing are noble, but unfortunately, reality eventually sets in, and IT is left doing the best it can with available resources. Documentation is usually the first casualty as keeping it up to date is a time consuming and tedious task. There is also seldom anyone specifically assigned to maintaining documentation. The next casualty is testing. An initially zealous testing plan of once or twice per month usually falls victim to other more pressing tasks; after all, there is no apparent disaster on the horizon. As a result of lack of updated documentation and only occasional testing, most disaster recovery efforts eventually devolve into a series of best efforts, where in the case of a real disaster, IT need to scramble to pull victory from the jaws of defeat.
Using Orchestration to Overcome DR Documentation and Testing Realities
Given the harshness of reality, IT needs to look to technology to help them continue to meet the DR needs of the organization and to make sure that when disaster strikes, recovery looks like a logical process instead of a fire drill. A DR automation system enables IT to set specific workflows for the disaster recovery process. Administrators can link in application interdependencies and make sure that recoveries happen in the correct order. The software can also perform soft tests of the DR process and alert IT administrators of potential problems before they spend time testing or going through an actual disaster.
With automation, the DR process becomes a literal push-button process. Once the DR button is pressed, the orchestration solutions execute the script. During a test, if an error occurs or if a change requires reprioritizing application recovery order, IT merely makes changes to the orchestration’s workflow. In doing so, the DR documentation is also automatically updated.
DR Orchestration Enables Scale
A disaster does not consist of one recovery; it is a series of potentially hundreds of restorations depending on application criticality, disaster type, and length of time that the primary data center is unavailable. Managing the potentially dozens of recoveries that need to occur along with the permutations of restorations based on the type of disaster is almost impossible to keep track of, let alone update. DR orchestration enables a small amount of IT personnel to initiate recovery at scale from a wide variety of backup storage types.
DR documentation and DR testing are still fundamental elements of a successful disaster recovery process. A lack of time or resources doesn’t make them any less critical. In fact, it makes them more so. IT needs to leverage technology to assist them in meeting DR requirements so that recovering from a disaster is an organized, tested and documented workflow instead of a fire drill.