Deduplication can be deployed as either a “target” based solution, where backups are pushed over the network to an appliance where deduplication takes place or a “source” based approach, where deduplication takes place at the client or server level. Some vendor offerings provide a hybrid of both source and target based deduplication to afford the greatest number of benefits for the end user. As might be expected, the decision to adopt a target, source or hybrid based approach depends largely on the business use case, the characteristics of the underlying data itself and where the particular end user is in their backup technology lifecycle. This article will examine all three approaches to data “dedupe” and provide guidance on how to determine which is the best fit for your individual requirements.
Target Based Deduplication
Deduplication was the key tipping point technology that made disk based backup a reality and target based data deduplication was the first entry into that market. As a pioneering technology, it was coming up against stiff competition in the traditional tape backup marketplace. When deduplication appeared on the market approximately 10 years ago, disk was primarily being used as a cache or staging area and it was being integrated with tape device via virtual tape technology (VTL).
Initial target based deduplication product offerings were relatively small in overall density and performance, however, the messaging around data deduplication (20-1 data reduction ratios) on disk based appliances proved very compelling for enterprises weary with managing the operational complexities of a tape based and/or VTL infrastructure. As such, early adoption led to ever increasing market penetration which eventually resulted in the near full displacement of VTL.
Why the rapid adoption of target based data deduplication which eventually became known as dedupe? In addition to the desire to migrate away from tape and all its issues (operational overhead, risk, etc.), deploying a target based dedupe solution is non disruptive and relatively straightforward. A customer does not have to radically change how they do backups. Rather than direct their backup workload streams to a tape library, they can merely re-point their backups to a “target” appliance where all the data deduplication takes place under the hood of the appliance itself. As long as the backup application supported disk based backup targets, they could use most of the deduplication appliances that were coming to market.
In addition, since target dedupe systems are backup application agnostic, there is no need to change backup applications, make major modifications to the underlying network infrastructure or change operational processes. Users also have the option to “tape out” from the dedupe appliance to a physical tape library. This is useful for those customers that have a requirement to maintain a copy of backup data for long periods of time.
Deduplication also enhanced disaster recovery, by leveraging data dedupe with nightly replication, IT users can efficiently get backup data offsite without the risk of data physically leaving the four walls of the data center on courier vehicles. In fact, many users have opted to migrate long-term archiving on tape from their primary data center to the replica target data center (or co-lo facility) that way tapes never have to leave their premises. This further mitigates the risk of lost tapes, lost data and any embarrassing press releases from such an event.
Target Dedupe Use Cases
As previously discussed, target based deduplication is a great way to augment existing tape based backup infrastructures due to its simple integration with existing backup applications. So if an end user is happy with their current backup application, a target based approach may be very compelling as it does not necessitate a significant change of the backup infrastructure or the relearning of a new backup application.
One of the strongest use cases for target based dedupe is protecting large Oracle or SQL database (in excess of 2TB’s) environments. Since each of these applications can natively backup their data to a target based appliance, from an integration standpoint, there is low/no barrier to entry to begin protecting these environments rapidly. Some deduplication appliance manufacturers even offer direct integration with database backup tools like Oracle RMAN. This helps drive out some of the inefficiencies in the data center as both backup administrators and DBAs can share a common pool of backup resources and achieve reductions where there may be an overlap in backup processes.
Another benefit to database target dedupe is a reduction in the consumption of primary storage resources. Many DBAs utilize a disk to disk to tape backup scheme to protect log files, table spaces etc. on primary storage that has no deduplication capability. Between periodic snapshots and direct dumps to disk, there tends to be an over consumption of expensive disk resources when protecting database environments. By migrating database backups to a dedupe appliance, scarce primary storage resources can be reclaimed for production applications. What’s more, the dedupe effect for database environments can range from a 5-1 to a 20-1 reduction in the physical disk or tape space required to perform normal backup operations – potentially generating a return on investment.
In short, target based deduplication provides an easy entry point for customers interested in leveraging the efficiencies of dedupe to reduce backup windows, enhance data protection and improve overall operational efficiencies in the data center without making major changes to their environment.
Source Based Dedupe
Source based deduplication generally consists of backup application software with embedded deduplication at the client layer and some form of disk storage to serve as the repository for backup data. As the name implies, source based data deduplication takes place at the source or where data originates – at the server or application layer. Source based deduplication offerings consist of placing a lightweight backup agent at the virtual or physical server and then only backing up the unique data segments that have changed since the prior backup job. The data segments are then sent over the LAN and/or WAN to a disk based storage grid for protection.
The advantages of source based dedupe are rapid backup windows and a large reduction in the volume of LAN/WAN traffic generated during the backup window. Instead of pushing a full backup over the wire from a media server each night, unique deduplicated backup segments trickle between the application hosts and the backend dedupe storage grid. The data transfers are extremely efficient – equivalent to a meta-data handshake. Think speed of source side dedupe as an incremental (only more efficient) forever with the benefit of producing a full backup each night.
In general, source based dedupe is a great solution for environments with a low daily data change rate. Great use cases for source based dedupe solutions include data on the edge – remote offices and laptops in the field (great for C level execs) and file heavy environments (NAS, VMware). In the case of remote offices, often the investment in source based dedupe can be justified by the cost avoidance attained by eliminating the legacy remote office backup infrastructure – backup server hardware, software, tape libraries, tape media, courier expenses, etc.
Some manufacturers even provide a virtual edition copy of their dedupe software to run inside virtual machines or at the hypervisor layer. So in many edge offices, local restores can be performed off the virtual machine local disk while backups can efficiently traverse the WAN and be stored in the data center for DR purposes.
Adopting a source based deduplication solution may involve introducing new backup software and hardware into the data center. Some technology suppliers have integrated source based deduplication into their legacy backup applications, however, feature bolt-on’s often don’t provide the same level of efficiencies as deduplication systems built from the ground up. In short, it is worth considering replacing your backup application if it is near the end of its useful life and falls short on extended features like dedupe.
Some manufacturers now tout the ability to deliver source and target based deduplication under a single management framework. In this scenario, all backup workloads are controlled and optimized from a single console and a common disk storage appliance maintains all the backup data. While this seems like the logical progression for architecting deduplication into data center environments, it has not been deployed on the same scale to date as homogenous source and target based solutions. Over the next year, the integration for supporting a hybrid source and target environment under a common management and storage platform will become tighter and be closer to production ready.
Deduplication, whether source or target, has become widespread across enterprises of all sizes. The best way to apply dedupe is to examine the use cases based on the data attributes in your environment and assessing whether a wholesale change of the entire backup infrastructure is in order or whether an additive component is the best approach.