When using a cloud backup as a server (BaaS) or disaster recovery as a service (DRaaS) the perception is that backup performance is about the same between solutions since they all count on the Internet for transfer and they all use similar techniques like deduplication, compression and bandwidth optimization to get the most out of that Internet connection. But how those techniques are implemented can vary widely between vendors. It is critical to understand how they work in order to select the best performing BaaS or DRaaS solution.
Compression and Deduplication
Compression is an age-old practice of looking for and eliminating repetitive patterns within a data stream. Compression only looks at an individual file or datastream and attempts to find patterns within it that can be replaced with other patterns. A simple example is a file that is full of white space (i.e. a block containing no actual data) can be easily compressed by recording how much white space it has and where it is located, rather than actually storing the white space.
Although deduplication is often confused with compression, it is actually very different. Deduplication looks at a block of data and attempts to see if that block of data, or portions thereof, has been seen before by the deduplication system. Most deduplication is done via hashing. Any given chunk of text – whether a few bites or many gigabytes – can be represented by a single value known as a hash.
(For example, the above paragraph converts into the following SHA-256 hexadecimal hash: 8d8242fe45b0213196941382ab9650db2c0e9854b852bdd1da46bcda47bc2d23.)
The hash cannot be reverse engineered into its contents; it can only be used as a comparison. Take two segments of data and calculate their hash; if their hash is the same, their contents are the same. If a file or a portion of a file in one backup is the same as something that’s already been stored in another backup, the hash values will match up and a very large chunk of data can be replaced by a very small hash. This not so simple technique can reduce the amount of data sent across the Internet by two orders of magnitude. That is why so many consider deduplication to be table stakes for cloud backup products.
The final backup speed optimization technique is really a variety of techniques known as bandwidth optimization. They include things like using the most optimum protocol for any particular operation, a decision that must be made at design time taking into account the transfer will happen over a WAN connection. Network traffic can also be shaped and prioritized differently based on the kind of connection that it is using. Finally, a technique known as forward error correction (FEC) can be used to minimize retransmits of data that might be corrupted or lost in transit. The error correcting codes sent along with the data can be used to reconstruct corrupt or missing data similar to the way RAID arrays reconstruct missing data from failed drives. WAN optimization is a science unto itself and can yield incredible results when the application sending the data uses its techniques with knowledge of the type of data being sent and the connection upon which it will be sent.
Sending backup data across the Internet is not something to be taken lightly. First, the data must be reduced as much as possible using data reduction techniques such as compression and deduplication. Second, the data must be sent using bandwidth optimization techniques including traffic shaping, protocol optimization, and forward error correction. Each of these techniques can have significant impacts on backup speed and network cost. Using all of them together can yield tremendous benefits — if the techniques are implemented well. Don’t assume that all implementations of efficiency and optimization techniques are created equally.