One of the benefits of virtualization has been the ability to recover applications and servers more rapidly. Recently some backup software vendors have added the ability to recover a virtual machine directly from the backup image. A technique commonly called recovery in place. While this concept sounds ideal it is not without its challenges and an alternative recovery strategy like changed block recovery may be a better option.
What is Recovery In Place
Virtual machine (VM) specific backup solutions almost all leverage disk as the primary backup target. This allows them to complete backup jobs of VMs very quickly. Many of these applications also can perform changed block backups, where only the blocks of data that have changed since the last backup are transferred to the disk device, making network backups even faster.
Many of these VM specific backup solutions have leveraged this configuration to be able to create a recovery in place feature. This enables a VM to be restarted directly from the backup target. Data does not need to be transferred back to the original host. The typical process to do this is first, the target destination device is mounted as a VMware data store, and then the VM on disk is snapshotted to preserve its data. Then the VM is launched and services begin.
The Challenges with Recovery In Place
Recovery in place sounds like an ideal solution for a failed VM or corrupted data set. But there are problems with this solution which the backup manager needs to be aware of. First, disk backup appliances (the most common VM backup target) were not designed to host VM data. Backup jobs are typically very large sequential write workloads. VMs typically exhibit small and very random I/O patterns. Also backup jobs must continue for the rest of the environment so VM performance can be further affected by those inbound jobs.
The second problem is the “fail-back” of the VM. Eventually, the goal will be to return the VM to its primary storage location. VMware does an excellent job of this data movement, via its Storage vMotion feature, by migrating the VM in real time from secondary storage to primary storage. The problem is that while this transfer is taking place performance will once again suffer. The performance impact is further compounded because the entire VM needs to be transferred over to primary storage, not just the parts of the VM that failed.
What Fails when a VM Fails?
When a VM fails it is rare that the entire VM becomes corrupted. If that is the case it is typically something more severe, like a site failure or disk array failure, and a recovery in place solution would not be suitable at all for these situations. For these situations there are tools like site recovery manager and storage based replication that are more suitable for large scale return to operation situations.
What typically fails when one VM fails is not the VM itself or even the operating system within the VM. It is typically the data within that VM. A good example is a database corruption or application code failure. And usually it is not the entire database or code that needs to be replaced but a subsection of it. In this case recovering the whole VM or forcing users to login to a new VM may be overkill and may be more time consuming than just recovering the component that failed. Certainly the effort required in deciding when to fail-back and managing users through that process is significant.
In data protection it is ideal to only have to recover what needs to be replaced. Anything more than that is wasting time and resources. Vendors like EMC with their Avamar product are leveraging similar technology to changed block backups and applying it to recovery, essentially creating changed block recovery.
What is Changed Block Recovery?
Recovery requests typically come in the form of “Recover this VM to how it looked at this point in time”. Changed block recovery is the process of analyzing what blocks of data need to be recovered in order to meet the recovery request. This requires only the components of the VM that have changed since that timeframe to be recovered.
Similar to changed block backups the amount of data to be transferred is greatly reduced with changed block recovery. However because this is recovery the value of this technique may be even more valuable. First, disk writes are always slower than disk reads, the less writes performed the sooner the server can return to operation. Second, there is more time pressure on a recovery than a backup; users are “standing around waiting”.
The Changed Block Recovery Advantage
Changed block recovery has two significant advantages over recovery in place. First, the recovery is exactly to the storage system the VM is designed to be executed on. There is no secondary fail-back process that needs to be performed. When recovery is complete the job is done. Second, only the data that’s needed to bring the VM to the right point in time is transferred so there is minimal amount of data movement.
So while recovery may take a few minutes longer, once that recovery is done it is done. In the recovery in place model, the entire VM has to be transferred back and until it is, performance is likely to be impacted.
Recovery in place is a valuable and attractive option for data centers with virtualized servers but its like any other technology, it is not without its weaknesses. Changed block recovery addresses many of these weaknesses with potentially less downtime and less chance for recovery errors.
EMC Data Domain is a client of Storage Switzerland