A Product Analysis of Dell Rapid Recovery
It’s time for a new type of backup and recovery because the old ways of backup simply do not meet the current needs of many businesses. It starts with the inefficiencies of how backups are typically done, and ends with how those backups are completely unable to meet modern day recovery requirements.
Consider the typical backup system that backs up each server at an appointed time each night. If it’s an incremental backup, the client will back up the entire contents of every file or database record a user creates or modifies since the last backup. The amount of data that changes each night is typically less than 1 percent, however, the size of most incremental backups is more like 10 to 20 percent. This is because traditional backup agents only know how to backup an entire file or database record if any part of that file or record changes. (Change one letter in one word in a 150MB PowerPoint file, and the entire 150MB gets backed up.)
Then there is the incredible inefficiency of the occasional full backup. While some backup systems do not perform full backups on unstructured data (e.g. files), all traditional backup systems perform occasional full backups on structured data (e.g. databases). Unfortunately, 80 to 90 percent of what is in a full backup is already in the backup system. It is simply copying the same data again because it knows no other way.
Typical incremental backups and the occasional full backup place a significant I/O burden on the clients it backs up, the networks that transfer this data, and the servers and storage systems that will hold and process this backup data. Backing up this way also creates an incredible amount of duplicate data. It is estimated that for every gigabyte of primary data, there are 20 gigabytes of backup copies. This is why the entire deduplication industry exists. Without this incredible level of inefficiency, there would be no reason for deduplication.
The real inefficiencies come, however, when one looks at the typical recovery process. Consider a server whose storage array suffers a catastrophic failure. If the identification of the problem requires an onsite visit from the storage vendor, things are going to take a while. While the customer may have a four-hour response time with the vendor, that typically only addresses the initial response. It may be many more hours before the storage array is actually repaired, and it may be even longer if the repair kicks off a rebuild of a large RAID array.
Several hours later, the system administrator can finally start the actual restore, which could take a few minutes, several hours, or even a day or more. Assuming the restore is successful, the system can finally be back into commission. If the restore doesn’t work, the system administrator has to restart the entire restore.
The first issue with this recovery scenario is the company’s recovery time objective, or RTO, which is how long the company says an application can be down. Considering the time it takes to diagnose and repair the problem, possibly rebuild the RAID array and perform the actual restore, the total downtime will typically be far longer than any agreed-upon RTO.
The second – and perhaps bigger – issue is the company’s recovery point objective, or RPO. That measures how much data can be lost in a typical recovery – typically measured in a few hours. The smallest RPO that can be met with nightly backup is 24 hours, but the reality can be closer to 96 hours. Consider what happens if a fire strikes Monday afternoon and destroys a server and its backup tapes before they are taken offsite. The latest available offsite backups were created Thursday night! Assuming the backup worked, the company will lose all data created on Friday, Saturday, Sunday, and most of Monday. If Thursday night’s backups didn’t work, they could lose more than 96 hours of data. It is likely that many companies’ RPO is something much shorter than 24-96 hours. The IT industry is rife with stories of people that have lost their jobs due to the inability to recover.
What would happen if drivers treated flat tires the way IT treats servers? To start, they would not carry a spare. They would pull over to the side of the road, see that they have a flat and then call roadside assistance. They would then wait a few hours for roadside assistance to come by, after which the roadside assistance representative would verify that there is a flat. However, the roadside assistance representative might not have the tire that fits the car in his truck, resulting in another delay. Once another roadside assistance truck arrives and puts on the tire, it would be up to the driver of the car to put air into their own tire using their own hand pump. Many, many, hours later, the car would finally be on its way with a new tire. How ridiculous does this sound? This is how typical restore operations happen.
This doesn’t have to be the case. Dell Data Protection | Rapid Recovery, formerly known as AppAssure, creates one full backup, and from that point on everything is a block-level incremental, meaning that each backup contains only the blocks that changed since the last backup. In addition, it creates backups throughout the day – as often as every five minutes. Instead of backing up 10 to 20 percent of the data every day, this system transfers and stores each day less than one percent of the total size of backup data. In addition to making backups much easier on the backup client, backup network, backup server, and backup storage, backing up throughout the day also allows for a very tight RPO. It is possible to restore servers to the latest backup, taken just minutes before an outage – instead of 12-24 hours before.
Rapid Recovery integrates with Windows Volume Shadow Services (VSS), the supported way to backup applications in a Windows environment. Before Rapid Recovery performs a snapshot, Rapid Snap for Applications technology interfaces with VSS, which in turn interfaces with each VSS-supported application to tell it to do whatever it does before a snapshot is taken. This ensures that when the Dell Rapid Recovery snapshot is taken, it is application consistent.
The Rapid Recovery backup server allows a backup recovery point to be mounted as a file system, enabling the administrator to directly recover the desired files and folders back to the system quickly and efficiently. In addition to file level recovery, Rapid Recovery snapshots also provide bare metal recovery (BMR) by restoring the entire system back to a previous state. The true wonder, however, happens when Rapid Recovery integrates BMR into its Live Recovery technology, which allows system administrators to temporarily use the backup system as their production system. Since the Rapid Recovery server will most likely be using secondary storage (e.g. SATA drives) as its target, it might not be as performant as the production system; however, it will allow the business to continue and will give IT time to properly diagnose and repair the system.
The live recovery approach has major advantages to the traditional restore process. First, it allows the application to resume normal operations within minutes of an outage instead of having to wait for a diagnosis and repair. A traditional restore forces IT to repair the system as quickly as possible — often in a haphazard way — because the restore process cannot even begin until the system is repaired. Then once the system is repaired, they begin the long restore process before the business can resume operations. With Live Recovery, business can resume as soon as the Live Recovery copy is mounted as the production system. This makes the repair of the production system much easier to do, as it removes the time pressure from the process. Once the system is repaired, Rapid Recovery will restore any data that changed while the system was down and return it to operation. Finally, it means that there will be minimal data loss, since the recovery process is using a backup that was taken only a few minutes before the outage. (Rapid Recovery supports up to 288 snapshots per day, which is a snapshot every five minutes).
Consider again the flat tire analogy. The modern method to help prevent getting stranded with a flat tire is to use run-flat tires, which allow the vehicle to continue being driven after a puncture or other loss of tire pressure. Punctures and other tire failures can cause accidents and injuries, and frequently create a dangerous situation while the driver attempts to pull over to the side of the highway to change a flat. The run-flat design provides the driver with a backup solution to keep the vehicle moving, safely, until he/she can find a repair shop and properly fix the problem. This is how the backup process should work — and this is how it does work with Rapid Recovery.
Dell is now releasing Rapid Recovery 6.0, which adds some important features. Perhaps the most important update is that they no longer require the installation of an agent in a VM in order to back it up. Dell Rapid Snap for Virtual technology supports using the VMware API for Data Protection (VADP) to backup VMware VMs instead (as of this writing, Hyper-V is next on the list).
In addition to being able to run recovery versions of a VM from the Rapid Recovery server, customers of Dell Rapid Recovery 6.0 will also be able to run a recovery version of a VM in the cloud, regardless of whether the backed up system was a VM or a physical machine when it was backed up. This creates a number of recovery scenarios that simply aren’t possible with traditional backup.
Finally, Dell has improved the Rapid Recovery interface, making it much easier to use and understand. There is a new dashboard with widgets to provide a quick view into the Rapid Recovery environment. There are also filters to allow administrators to more easily locate particular systems that are being protected.
Dell Rapid Recovery provides the typical features necessary in a backup system, but where it truly stands out is when the Live Recovery features is activated. It allows customers to immediately resume business operations after a major failure — and this is no small feat. This is how the backup process should work.