Disaster Recovery Planning (DRP) doesn’t need to be the lengthy complicated ordeal many make it out to be. In fact, DRP needs to become a more nimble process that is completed quickly, easily updated and frequently tested. The first step in creating a DRP that works in the real world is not planning – it is assessing. While most planners will try to create an inventory of the applications that need to be part of the DRP, they miss an important step in the pre-plan process, assessing what the organizations DR capabilities are. They need to perform a DR gap analysis.
There are essentially three abilities needed to execute a DRP; capture, movement and recovery. An organization needs a solution, typically a combination of hardware and software. Most organizations have existing solutions that will meet, at some level, those needs. IT planners need to inventory and assess the existing solutions capabilities to meet the expectation and requirements of the organization. The DR gap analysis is the measurement of where those applications can get the organization, what is potentially possible with new applications and where the organization needs to or expects to be.
A popular data protection vendor’s marketing phrase is, “it’s all about recovery.” But the truth is that recovery can’t happen if there is no data capture or if that capture isn’t happening frequently enough to meet the demands of the organization. Data capture tools include traditional once a night backup applications, snapshot-replication solutions, and modern backup applications with block-level incremental backup capabilities.
Each of the data capture methods has different capabilities in terms of how frequently can they capture and retain information. For example, traditional backup typically only protects data once or twice per data but has very long retention capabilities. Modern, block-level incremental backup solutions can capture data multiple times per day and can retain information for a more reasonable time before requiring another full backup. Replication-snapshots can capture data almost continuously and instantly but have relatively short retention capabilities.
Each of the data capture methods has a cost associated with them, both regarding the software that drives the process and the hardware that supports it. Typically traditional backup is the least expensive and snapshot-replication the most, with modern backup somewhere in the middle.
Planners need to assess their existing capture capabilities with each of these data capture types, if they have all three and set service capabilities in terms of how much data can be captured by each solution and how frequently can it be captured.
The next capability to assess is the organization’s ability to move data. IT planners should consider two points of movement. First is the movement of data to a system at a disaster recovery site. That DR site needs to be far enough away from the primary site the same disaster will not impact it as the primary site. The secondary site could be a site the organization owns, space the organization rents from a co-location facility or rents from a cloud provider.
The distance requirement also means that how data moves to that site becomes critical. If data will move electronically then it has to be replicated asynchronously to the second site. That means the capture solution needs to be able to identify data that is newly added or changed and only move that data.
The other capability is a secondary storage system that is on-premises. Most disasters are not the type that makes headlines due to a massive natural disaster. Instead, they are smaller, typically only impacting just the organization. For example, failure of the primary storage system could cause the organization’s applications to be shut down, or a ransomware attack could cause all of the organization’s data to be encrypted and locked. A secondary system allows for rapid recovery without having to fail-over to the disaster recovery site.
Restoration is the acid test of a DRP, will all the planning and protection steps work? Restoration has two parts. The first is initial recovery from a disaster and the second is fail-back. How does an organization recover back to its primary data center when the time is right? Restoration capabilities are an aspect of the capture solutions.
For example, snapshot-replication are typically making live native copies of data, no data transformation of even restoration is required. Most modern backup solutions, because of the way change block incremental backup stores data can create a recovery volume directly on the backup system, there is a minor transformation required but, again, no data movement. Most legacy backup solutions store data in an optimized format, to reduce storage costs. They need to not only transform data but also restore that data to another storage system.
The fail-back process depends on the state of the original primary data center and how long the organization has been operating from the secondary site. It is also important to remember that while the organization is at the DR site that the data protection (capture) process start all over again, with protected copies being moved somewhere else other than the DR site.
Generally, the original primary site will be in one of two states, a total loss with no data available, or in perfect shape because the disaster never actually impacted the data center. In the first case deciding what data to send back to the DR site is easy. It all has to be sent back. What’s more difficult is how to send all that data back to the primary site. In most cases, it will require a bulk transport of data via plane or truck.
In the second case, the data center is intact. The only data that needs to be sent back is just the data that changed while the organization was operating out of the secondary site. The problem is identifying the changed data and moving it. Most snapshot-replication options can reverse the process, scan what is at the original data center and update it with changed data. Most backup applications, modern or traditional, do not have this capability. The problem is the secondary site while it is running becomes a separate backup instance and does not know of the first site. The organization either has to be very careful in the execution of backups at the secondary site, have to be prepared to move data back to the primary site manually or they have to completely recover the primary site even though most of the data is intact.
Each of the required DRP abilities, as well as the capture methods, deserves a more detailed discussion. We will go into that detail in future columns. We also cover the complete DRP process in our Disaster Recovery Planning Workshops. By attending these workshops, you will learn the DRP system that we’ve used around the world to help organizations meet and keep up with their disaster recovery requirements. They are “vendor-free” meaning you don’t have to sit through a bunch of vendor presentations in-between learning our disaster recovery techniques.