Recovery Time Objective (RTO) is the time it takes to recover an application or server so users can login and get back to work. IT is under constant pressure to reduce this time. There are two broad methods you can use to improve your RTO: adjust the objective itself, and/or adjust your techniques to accomplish that objective. Technically the first method changes your RTO; the second method helps you to better achieve your RTO.
Meeting Your RTO By Changing It
Each application should have an RTO associated with it, and this RTO should be determined by the appropriate business unit calculating the cost of downtime. Suppose a business unit has determined that the company will lose $10,000 for every minute that a given application is down, so they’ve specified in RTO of one minute.
Now suppose the IT department determines the cost of a system capable of meeting an RTO of one minute is $100,000. They also determine the cost of having a five-minute RTO costs only $10,000. Finally, suppose they also determine a one hour RTO costs only $1,000. The IT department then presents these choices to the business unit and allows them to make a business decision.
One choice is to spend $100,000 on an RTO of one minute and theoretically lose only $10,000 in an outage (i.e. one minute of downtime). The other choice is to spend $10,000 on the five minute RTO and possibly lose $50,000 during an outage (i.e. five minutes of downtime). Finally, they can choose to spend only $1,000 and possibly lose $600,00 (i.e. 60 minutes of downtime) in an outage.
What if they were able to find an insurance policy to cover the event of the $600,000 outage? A business person might choose the thousand dollar option that gives them a one hour RTO and let the insurance company take the risk. This is the easiest way to improve your RTO: get the business unit to change it.
What is Your Recovery Time Reality (RTR)?
What you are asking when you asked how to improve your RTO was how can you have a faster recovery. It’s important to understand that the RTO is just that – an objective. The only way to improve your RTO is to do what is suggested above: get the business unit to change it. We like to use the term recovery time reality (RTR) to refer to the amount of time it actually takes to perform a recovery. What you’re really asking is ”How can I improve my RTR?”
Tape vs. Disk (Recovery Edition)
The first thing that most companies can do to improve their recovery time is to use disk as their main recovery mechanism. Tape has a lot of things going for it, and I am far from being a tape basher. But when it comes to recovering dozens or hundreds of applications from a disaster, it’s really hard to beat disk. A large RAID array can simultaneously feed dozens or hundreds of recoveries without breaking a sweat. More importantly, each of those recoveries will be allowed to go with the stream that the target device supports.
The problem with tape is that to get full performance from it you have to keep it steaming. That means that the receiving server and its storage must be able to keep pace. If not the tape drive will start to adjust its speed, a phenomenon known as shoe-shining.
Most backup systems that support tape use multiplexing that interleaved multiple backups in order to keep the tape drive streaming. During a large recovery, however, you are reading all of those streams and throwing away most of them. This means that the speed of a recovery from tape will be dictated more by your multiplexing settings than anything else. Disk backups aren’t multiplexed and do not have this issue.
It’s All About Location
The next thing that you can do to lower your recovery time is to make sure that your recovery media – be it disk or tape – is co-located with your recovery servers. Cloud backup is an awesome thing, but if the only copy of your data is in the cloud and you are recovering servers in your data center, you will find yourself banging your head against the wall as the download of all your data takes days or even weeks. Either figure out how to recover applications in the cloud, or figure out how to get a local copy of your backup data in your data center.
The Best Recovery is No Recovery
The best thing you can do to improve your recovery speed is to not do a recovery at all. This includes making sure that you use highly available storage and highly available compute nodes for applications that need high availability. But it also means examining modern data protection mechanisms that can do what we call recovery in place; vendors refer to this as instant recovery or boot from backup. That is, the vendors provide a feature that allows you to run your servers or data stores directly from the backup, instead of having to wait for a restore before you can use your systems again. This is relatively new functionality and must be tested extensively, but it is the quickest way to recover your data in most scenarios.
Replication to a Standby Target
The only thing you can do to improve your RTO is to negotiate a more realistic objective with the business unit of each application. Have an open dialogue with them about the cost of what they are asking for and make sure that it is in alignment with what they expect. Once you’ve agreed upon an objective, the best thing you can do from a recovery perspective is to switch to disk as your primary protection mechanism. Secondly, make sure that your recovery data is located in the same place you plan to do a recovery. Finally, examine modern data protection methods such as instant recovery that can allow you to meet almost any RTO.