What is Copy Data?

Posted on October 15, 2014 by George Crump

Copy Data is the term used to describe the copies of primary data made for data protection, testing, archives, eDiscovery and analytics. The typical focus of copy data is data protection to recover data when something goes wrong. The problem is that each type of recovery requires a different copy of data. Recovery from corruption requires snapshots. Recovery from server failure requires disk backup. Protection from disk backup requires tape. Finally, recovery from a site disaster requires that all these copies be off-site. Add to the data protection copy problem all the copies being made for test/development, archives, eDiscovery and now analytics. The end result: copy data is about much more than data protection and providing the capacity to manage all these copies has become a significant challenge to the data center.

If the demand for primary storage capacity is growing, copy data capacity demands are exploding. Storage Switzerland believes the “copy” data set is outpacing primary storage by as much as 20X. The challenge for the IT planner is how to get a handle on this copy data. They have to make sure that the costs of the copy data infrastructure does not outpace the costs of primary storage.

Where does Copy Data come from?

Data centers have always had to manage copy data. In the past when tape was the backup for all data, off-site protection required a copy of those tapes. This produced an immediate 2X growth in copy data. The tape backup also stored many versions of these backups for some period of time. As a result, the tape capacity required was close to 4X primary storage. Also, most backup administrators run full backups once a week, once a month, or even once a quarter. As a result, the 4X could grow to 24X for a two-year retention period. The only saving grace for tape was that its cost per GB was, and still is, very low compared to disk.

Next, organizations began to look at disk to support faster application recovery. The first form of this was synchronous disk mirroring and asynchronous data replication. An application could use these copies for rapid recovery with minimal data loss, but its downside was it required expensive primary storage. This type of copy represented a 4X growth and 6X growth when replicated off-site for disaster recovery.

Data centers then started to look at disk based backup to improve data protection performance. At first these were raw disk systems, with no data efficiency capabilities. These systems then added technologies like deduplication to reduce costs but they still had to start with a baseline of disk capacity, this typically adds another 2X. Unlike tape, deduplication does reduce full backup redundancy, but it still stores the unique data between those backups. As a result, deduplicated backup copy data can be as much as 10X the size of primary storage, if replicated for DR, that jumps to 20X.

Then recovery requirements and demands only continued to increase. Even with disk backup, the time to transfer that data over the network was too time consuming. The next step was for IT planners to move to storage systems with snapshot capabilities which became broadly available about the same time that backup disk with deduplication came to market.

Snapshots allow many point-in-time views of primary storage. When first taken, a snapshot should not immediately require a 2X storage increase like a mirrored copy does. Over time however, as the active data set changes and the storage system has to maintain the older view, data growth occurs. As a result, the storage system has to maintain two or more copies of any active data set. In actuality, this copy method is one of the most efficient available to the data center.

In general, the more snapshots that are created and the longer they are retained, the more primary storage capacity will be consumed. To make matters worse, some storage systems also require that the storage administrator hard provision a snapshot reserve which can be as much as 30% of total production capacity. Despite the efficiencies of snapshots, IT planners should assume a 1.5X capacity requirement, above the total capacity of the system, for snapshot data space.

Copy Data Caused By More Than Just Data Protection

The above examples revolve around the data protection process. But processes other than data protection create copy data. A long time creator of copy data is the test – development process (Test/Dev). This needs near real-time copies of production data, the closer to real-time production the data is, the higher quality the test results will be. It is common for test/dev to count on several copies of production data so that various iterations can be verified. IT planners should assume that test development data account for as much as 6X the size of primary storage.

The other non-data protection process, which creates a need for copy data, is data analytics. While many organizations are beginning to experiment with data analytic processing, it is becoming a rapidly growing source of copy data. It is hard to quantify just how much analytics will cause the copy data set to grow, however, it is safe to assume that analytics may require at least another 2X added to the production data set.

The Cost of Copy Data

The cost of maintaining copy data can be significant. First, there is the obvious cost of just acquiring the raw capacity. But this is just the tip of a titanic sized iceberg. The big problem is the management of these processes. Most of these copies are on different storage silos that are all managed by different processes.

Just from the examples above, copy data could use at least six different storage systems. It can also mean six different processes to create that data. This can lead to a much more significant chance of human error with so many processes running in parallel to each other. Each of these separate processes typically need a direct interface with the primary data set to make their copies and this can lead to a loss in application performance and potentially even data corruption.

Insight is another big challenge. All these processes count on working with the most recent copy of production data. How can systems managers be assured that they are always getting the most recent image? Also, how does the storage management team know that the test/dev team will prune excess copies when their work is complete?

Copy Data Management is Copy Data Convergence

An investment in copy data management can have a big impact on the cost of storage in the data center. And unlike other storage management processes, it does so with no impact or change to production storage. A copy data management solution converges all of the above copies into a single tree that can have many branches.

Companies like Catalogic Software are providing solutions that build on the efficiency of snapshots, but do so across storage systems, essentially they ‘catalog’ all files and snapshots, vaults and mirrors be they on local, remote or cloud storage. This helps eliminate a key vulnerability to snapshot technology, the outright failure of the storage system and loss of access to snapshot copies. In addition, copy management solutions add in concepts from backup to provide data cataloging, which allows for rapid data search and retrieval. This second copy, updated in near real time, can then be used to capture a snapshot and/or be replicated many times over to feed the various copy data management processes outlined above. Another unique attribute of a copy data management system is its ability to allow these secondary snapshots to be writable, which is a requirement for test/dev or analytic environments.

The result is a secured storage system that leverages the efficiencies of snapshots to feed all the processes that require access to copy data. A copy data management solution could lead to a dramatic reduction of secondary data and improve operational efficiencies.

Summary

An increasing number of processes in the data center use copy data, leading to unabated data growth. The cost to procure and maintain copy data storage it, is out of control. Moreover, the current copy data “infrastructure” is often just a hodge-podge of technology thrown at the problem in hopes it will stem the tide. In reality, it makes matters worse.

It’s time for IT to take a step back and consider a holistic solution to the problem. One where a single process interfaces with production storage and then that process feeds all the consumers of copying data. Doing so will create a more stable production environment and a more cost effective secondary environment that is providing higher quality data.

Sponsored by Catalogic Software

Click Here To Sign Up For Our Newsletter

About George Crump

George Crump is the Chief Marketing Officer at VergeIO, the leader in Ultraconverged Infrastructure. Prior to VergeIO he was Chief Product Strategist at StorONE. Before assuming roles with innovative technology vendors, George spent almost 14 years as the founder and lead analyst at Storage Switzerland. In his spare time, he continues to write blogs on Storage Switzerland to educate IT professionals on all aspects of data center storage. He is the primary contributor to Storage Switzerland and is a heavily sought-after public speaker. With over 30 years of experience designing storage solutions for data centers across the US, he has seen the birth of such technologies as RAID, NAS, SAN, Virtualization, Cloud, and Enterprise Flash. Before founding Storage Switzerland, he was CTO at one of the nation's largest storage integrators, where he was in charge of technology testing, integration, and product selection.

Tagged with: Analytics, Archive, Catalogic, Convergence, Copy Data, Data Protection, performance, Recovery, Snapshot, Test/Dev
Posted in Article