Eventually most organizations will run out of backup storage, so understanding how backup storage scales is critical to creating a long term data protection strategy. It is not as simple as just converting to a scale-out architecture because deduplication adds a layer of complexity that makes a scale-out cluster more challenging.
The most common approach used in backup appliances is the scale-up approach. A scale-up backup appliance consists of a server typically running some version of Linux and the deduplication software. A small appliance will use internal storage that is directly connected to the server via the PCI bus. Eventually, most customers will need more storage than what can fit inside a physical server, so the more typical application will have one or more shelves of RAID-based storage attached to the server via Fibre Channel or SAS.
The advantage to this approach is that all data sent to this appliance will be deduplicated against all other data sent to this appliance. It doesn’t matter what backup client the data came from or when it was backed up. Any new data will be compared against all previous data stored in the appliance, which increases the deduplication ratio and reduces the amount of storage necessary to store your backup data.
The disadvantage to this approach is that you can only attach so much storage behind an individual server, especially if all of it is one large dedupe pool. Eventually, you will need to purchase another server with its own storage in order to scale beyond the capabilities of the original server. The problem with this, of course, is that the second server will typically have no knowledge of the backups stored on the first server and will not deduplicate against those backups. Some scale-up systems have been able to create an HA pair of servers who are able to deduplicate against each other, but even this approach eventually has a limit – roughly twice that of the limit of a single server.
The scale-out approach is less common in backup appliances due to the complications involved in deduplicating data in a multinode architecture, but such appliances do exist. When examining a scale-out backup appliance, it is important to understand whether or not they have a true global deduplication pool and the impact of managing of what is now a very complex deduplication process.
Some scale-out products are scale-out only in that they are centrally managed; each node in the system is essentially a deduplication island. Data is only deduplicated against data that was sent to that node; it is not compared against data sent to other nodes. A few scale-out backup appliances have the ability to globally deduplicate data across multiple nodes so that any data sent to any node is compared against data sent to all other nodes. This would be the preferred approach, but it is also the most technically complicated.
Object Storage Simple Scale-Out, Without Deduplication
In our entry “What are Your Backup Storage Options?” we made the case that hardware based deduplication is not as critical of a feature as it used to be. But the need to scale backup capacity always will be with us. It might make more sense then to focus on an object storage system, known for scalability and leave data efficiency to the software application.
Both the scale-up and scale-out approaches have their advantages and disadvantages in all areas of storage. In the backup target appliance market, however, the issue of deduplication creates the need for closer examination of how a given product works. If you are considering a scale-up approach, make sure you know the limits: just how big is the appliance you are purchasing allowed to grow before you must purchase a second one? If you are considering a scale-out approach, make sure you understand exactly how deduplication works in that system; do not assume it is global across all nodes. Knowing how your product works is the best way to properly use it.
Sponsored by Cloudian