There are two ways to improve your recovery point objective (RPO): adjust the objective, and adjust your techniques to accomplish that objective. In addition, technically only the first method changes (i.e. improves) your RPO; the second method help you to better achieve your RPO.
Improving your RPO by changing it
Each application should have an RPO associated with it, and this RPO should be determined by the appropriate business unit calculating the cost of lost data. Suppose a business unit has determined that the company will lose $10,000 for every minute of data that is lost, so they’ve specified an RPO of one minute.
Now suppose the IT department determines the cost of a system capable of meeting an RPO of one minute is $100,000. They also determine the cost of having a five-minute RPO costs only $10,000. Finally, suppose they also determine a one hour RPO costs only $1,000. The IT department then presents these choices to the business unit and allows them to make a business decision.
One choice is to spend $100,000 on an RPO of one minute and theoretically lose only $10,000 in an outage (i.e. one minute of lost data). The other choice is to spend $10,000 on the five minute RPO and possibly lose $50,000 during an outage (i.e. five minutes of lost data). Finally, they can choose to spend only $1,000 and possibly lose $600,00 (i.e. 60 minutes of lost data) in an outage.
What if you can’t change the RPO?
You need to figure out how to lose less data during a recovery. It’s important to understand that the RPO is just that – an objective. The only way to improve your RPO is to do what is suggested above: get the business unit to change it. We like to use the term recovery point reality (RPR) to refer to the actual amount of data that you would lose during a recovery. What you’re really asking is, “How can I improve my RTR?”
Interestingly enough, there is a relationship between RTR and RPR (recovery point reality). If it takes you 12 hours to backup a server or VM, the best RPO you can meet is 12 hours. (And that’s only possible if you back it up twice a day.) This is because there was a consistency point made at the beginning of the backup and you will lose any data that happened since that consistency point. Although the backup may have finished at 5 AM, it started at 5 PM. You will lose all data since 5 PM. Therefore, in many backup architectures, increasing backup speed may allow you to lower your RPR. But honestly, trying to lower your RPR by increasing the speed of a traditional backup is fighting a losing battle. You’re constantly at a war with physics.
The other way to lower your RPR is to take advantage of database transaction logs and redo logs. These are created throughout the day and can be used to recover to a point in time later than when the backup was taken. The key to using these to lower your RPR is to ensure that they get replicated off-site throughout the day. They are a great tool, but they can’t help you if they’re only available on the machine that died.
The most common way that people look to lowering their RPR to an hour or less is the use of snapshots and replication, also known as near-continuous data protection (near-CDP). This may be provided by your operating system, such as VSS snapshots in Windows, or provided by your storage system, such as the snapshot facilities built into most modern storage arrays. Either way, the key is to get an application-consistent snapshot on the source system and then replicate that snapshot to the destination system. Depending on the capabilities of the snapshot system, you may be able to meet an RPO of only a few minutes.
Meeting an RPO of less than a minute is only possible with a continuous data protection (CDP) system. That is, every single write is synchronously copied to another storage system that will be used in time of recovery. The requirement for synchronous replication, of course, requires locating the target system very close to the source system. This is why most people opt for asynchronous replication that allows them to move the system farther away. Of course, as soon as you do this, you are acknowledging that you will not meet an RPO of zero. Your RPR will differ based on the bandwidth available for replication at any given point in time during the day. The truly paranoid perform synchronous replication to a local system which then performs asynchronous replication to a remote system. This provides a combination of an RPR of zero for most recoveries, and an RPR of something close to zero in a full disaster.
StorageSwiss Take
Besides negotiating for a longer RPO, the best thing you can do to improve your RPR is to move off of traditional backups that backup the entire system every time. They simply take too long and require too many resources to perform continuously throughout the day. By design, this means you are accepting an RPR of at least 24 hours during a disaster, since you will only backup once a day. Snapshots and replication can significantly improve your RPR. But for an RPR of close to zero, your only choice is CDP. But since CDP systems are both rare and expensive, they should be your choice of last resort.