Backing Up Distributed Data

The modern data center is no longer confined to four walls. Instead, data is highly distributed across remote branch offices, end-user laptops and even on devices like smartphones and tablets. The data protection process and the hardware and software that runs that process needs to handle the challenges of diverse data types (databases, files) and a broad distribution of those data assets (mobile, remote office, data center and DR site).

The Distributed Backup Roadblocks

The primary challenge to distributed data backup is the fact that the interconnectivity between all the various locations where data resides is generally more limited than the available bandwidth within the data center itself. In addition, the connectivity is much more variable in quality. WAN and Internet connections may not always be active or may vary in their responsiveness. Consequently, bandwidth and its relative unpredictability has become a major roadblock that needs to be addressed.

A second aspect of this challenge is the fact that data transfers may consist of a few small files on an end-point device, to dozens of large databases in a branch office, to the replication of massive repositories of central data center information to a remote DR location.

Another major challenge with distributed data backup is the wide variety of applications that are typically deployed to enable the movement of information. For example, there are often specific software tools which perform data center to data center backup, some which backup remote virtualization clusters and still others which backup endpoint devices like laptops, tablets and smartphones.

These distributed backup applications are often incompatible with the central data center backup application. This has a series of cost implications because it increases software licensing costs, backup management costs and even backup data storage costs.

Step One: Understand Your Data

The first step in protecting the distributed enterprise is to define what the architecture needs to look like in order for the data protection goals to be achieved. Identifying the characteristics of your important data can help. Understanding where that data is located, where does it move to and how often does it change are all important aspects of protecting the broader enterprise.

It is also important to segment the remote and branch offices from the data center, to protect them first as entities so that a high quality and efficient backup can be performed. This is typically a local backup that facilitates fast local recovery and even bare metal restore in the event of a system failure. This has to be done while at the same time replicating them into a single backup storage target at the primary data center for disaster recovery in the event an entire site is lost. Finally you need to understand where the disaster recovery copy of the consolidated data set will be stored and this needs to be centrally managed to reduce management overhead and simplify restore. For example are there two primary data centers that need to be cross-replicated or is there a single data center than needs to have all its data sent to a separate location?

Step Two: Establish a Deduplication Standard

For each of these data locations (remote office, branch office, primary data center and DR site) deduplication is a critical aspect of the distributed backup process. The next step is to establish a deduplication standard and avoid having multiple deduplication operations in place. Part of this process is to understand where in the data path you want deduplication to occur. Choices include application source, backup server or on the target storage. It is important not to fall into the trap of a one-size fits all for deduplication. Each of the above deduplication locations has an advantage given the realities of the application, data location or device it is stored on.

The first aspect of establishing a deduplication standard is to make sure it has the global capabilities required for distributed data movement. Some technologies may be able to deduplicate the data within the remote location but can not share deduplication information with the central location or DR site and as a result loose all of their bandwidth efficiency.

It is very common for the central location to already have a copy of the data in the branch office. For example, a PowerPoint file that has been created at the remote office may have already been sent via email to someone in the central office during the day or it could have been uploaded to a SharePoint server. In either case, the central location already has the data and has backed it up. As a result, there is no need for the backup storage system in the remote location to retransmit it; but few deduplication technologies have this level of data awareness.

The second aspect of establishing a deduplication standard is the ability to allow flexible deployment of the technology. In other words it has to be modular. A problem that some duplication technologies have is they cannot be deployed in the variety of configurations required by distributed backup.

While some companies claim to offer a distributed solution it takes them two or more incompatible technologies to actually accomplish that objective. As mentioned above, the remote office is just one example of data distribution. There are also smaller offices that can’t justify a dedicated backup appliance and of-course laptops and other endpoints. To be effective and truly global, the deduplication algorithm needs to be flexible and highly modular.

Deduplication Sprawl

The lack of modularity has led to deduplication sprawl. Backup storage appliances typically do not fit every use case so multiple appliances have to be purchased for different reasons. In addition, backup software vendors have started to implement deduplication within their applications, but of course, these applications can’t synchronize with each other nor the backup hardware, so deduplication effectiveness is silo’ed within those applications.

The net effect is deduplication sprawl. This complicates an already complex problem and results in the sub-optimal utilization of bandwidth. It also leads to some distributed data centers having many different backup solutions all performing independent deduplication; resulting in redundant data constantly being sent across the WAN and consuming more storage capacity than would otherwise be necessary.

The greatest impact of deduplication sprawl may be its impact on the IT staff. Each of these silos need to be learned, managed and monitored which takes time, something that the IT staff does not have in excess.

Deduplication Modularization

What’s needed instead are modularized deduplication solutions like HP’s StoreOnce deduplication technology. This allows a common deduplication algorithm to be installed within the backup application such as HP Data Protector as well as on physical backup storage appliances and virtual backup storage appliances.

A common deduplication platform allows for deduplication meta-data to be globally available. Deduplication meta-data is the data that tracks what data sub-sets have already been stored by each deduplication process. Sharing this meta-data allows the backup application and appliances to maintain an awareness of each other and not send redundant data over the network connection. This results in a more highly efficient replication of data between the various distribution points and the main data center, as well as to the disaster recovery location. It also allows for greater overall storage efficiencies, which reduces storage capacity requirements and costs.

The appliances could be used initially to support disparate applications and the unique use case applications that appear in the distributed enterprise. This could be a first step; centralizing and optimizing the storing of protected data to minimize the amount of data that is transferred between instances. It should dramatically reduce the costs of capacity allocated to the backup process, especially in the primary data center.

Efficiently centralizing the physical storage of backup data should also provide some management improvements as well. While not as effective as a single data protection application for the enterprise, having all the data within a single deduplication infrastructure does have benefits.

Application Centralization

While 100% attainment is unlikely, the more that the data center can centralize on a single data protection application the better. But if this centralization point can’t take advantage of the foundation laid by the above backup storage consolidation, then that entire effort is lost.

One of the advantages of HP Data Protector is its ability to leverage the HP StoreOnce technology. It can either use the existing appliances or run the StoreOnce intelligence integrated into its code. This flexible deployment allows the deduplication work to be implemented where it makes the most sense, all while leveraging the initial ground work accomplished during backup appliance consolidation.

This strategy also accommodates the inevitable unique use cases that are sure to appear in the distributed enterprise, an application specific backup that is deployed because of application requirement or user preference. That application’s data can still be backed up to one of the deduplication appliances and fully leverage the global deduplication intelligence.

Conclusion

The first step in solving the distributed backup challenge is to eliminate the redundant use of bandwidth and storage capacity by implementing a granularly scalable deduplication technology. But this first step should be compatible with the next step; application consolidation. Application consolidation, while a more encompassing task, brings added backup management savings but needs to leverage, not replace, a modular deduplication strategy.

HP is a client of Storage Switzerland

Unknown's avatar

George Crump is the Chief Marketing Officer at VergeIO, the leader in Ultraconverged Infrastructure. Prior to VergeIO he was Chief Product Strategist at StorONE. Before assuming roles with innovative technology vendors, George spent almost 14 years as the founder and lead analyst at Storage Switzerland. In his spare time, he continues to write blogs on Storage Switzerland to educate IT professionals on all aspects of data center storage. He is the primary contributor to Storage Switzerland and is a heavily sought-after public speaker. With over 30 years of experience designing storage solutions for data centers across the US, he has seen the birth of such technologies as RAID, NAS, SAN, Virtualization, Cloud, and Enterprise Flash. Before founding Storage Switzerland, he was CTO at one of the nation's largest storage integrators, where he was in charge of technology testing, integration, and product selection.

Tagged with: , , , , ,
Posted in Article

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 17.4K other subscribers
Blog Stats
  • 1,979,429 views