The Big Data Archive

Posted on October 10, 2011 by George Crump

“Big Data” is often thought of as a specialized use case involving machine generated data, typically associated with web search logs, satellite imagery or other sensor data, on which analytics are performed to enable some sort of decision support application. While this is an important example of a Big Data project, another is the collection of human generated data that also needs to be retained, organized and made readily available for data mining or compliance reasons. An archive is needed to store both machine and human generated data and a Big Data storage solution can be the ideal solution.

Human generated data is essentially file data created from office productivity applications. These files are the contracts, designs, proposals, video, audio, images and analytical summary data that drives the organization. Also included in this category would be data files generated by multi-media tools such as video camcorders, mobile phone cameras, notebook PC microphones, etc. Just like machine generated data this data has value and a potentially higher compliance requirement, at least from a litigation perspective. But unlike machine generated data which often goes straight to low-cost archival storage, this human generated file data is frequently stored on expensive, high performance primary storage over its full lifespan.

At the point of creation and during early modification these human generated files can justify being stored on the more expensive primary storage location where rapid access is not only essential but expected. Over time though, most file based data rapidly loses its need for immediacy and could be more appropriately stored on something cost effective but maybe not as responsive as primary storage. However, unlike old copies of a database, which need to remain somewhat accessible for either mining and compliance or just to ease the minds of users, these older files can be put onto a secondary tier of storage.

This secondary storage area is an ideal use case for a disk archive tier, something designed specifically to store this type of data cost effectively. Again, this data will be retained because the organization “has” to, but also because it “wants” to. Companies will mine this data to provide insight to support future decisions. This mining requirement means that archiving alone is not enough for the organization. They need all the capabilities that a Big Data Storage Infrastructure needs, but in a more capacity-centric form.

The Value of a Big Data Archive

A Big Data Archive brings three specific value areas to the enterprise. First, similar to a classic archive, it should allow for the reduction of primary storage consumption and support growth. According to studies, well over 80% of file data on most primary storage systems is not in active use and is therefor wasting this high cost resource’s performance abilities. If this data was moved to a secondary, high capacity storage area, but still one with moderate performance, most additional file access could be done without impact to the users. This could have a significant, positive impact to the IT budget.

Secondary disk tiers have been available for years, as have software products to classify and move that data. The cost savings on primary storage alone motivated many users to move to a two-tier storage infrastructure. But many other data centers were not so inclined. Big Data Archive brings three more motivation points to the equation that should interest all data centers to adopt this multi-tier approach to storage.

The first point is that organizations are beginning to understand the value of this data and to acknowledge there’s a real desire to retain, categorize and in the future mine, this information to help make better business decisions or speed product development. They are coming to realize that archiving makes practical sense in the data center and its shortcomings are being eliminated by Big Data storage architectures.

The second motivational point is the need for compliance. Organizations and litigators are beginning to understand that retention is more than just making sure email is saved or that it can be found (discovered). Retention means keeping all the files that exist in relation to a case as well. In the past this meant providing boxes of paper documents. Today most documents are digital and are never printed. Retaining electronic documents is not only important it may be the only evidence of that information that can be retained.

Finally a Big Data Archive is complimentary and may even be part of what was previously considered a separate project. This makes the cost to add a Big Data Archive to a current big data project minimal or may allow it to be the foundational component in a future big data project. In short by leveraging both initiatives, costs can be contained and ROI be realized sooner.

As a result a Big Data Archive has unique requirements that a simple second tier of storage, or even a basic archive solution, typically cannot meet. Whether from machine or human generated data sources, a Big Data Archive must match the compliance capabilities of disk archiving while meeting requirements like dense scaling, high throughput and fast retrieval.

Requirements For The Big Data Archive

Density Scaling

Legacy second tier disk systems and even archive systems both have scaling issues when measured against the Big Data Archive challenge. The requirement to scale to Petabytes is now the starting point for many of these systems. This quickly eliminates single box architectures.

Even legacy scale out storage architectures may not be suitable for the Big Data Archive challenge. These systems were designed to add nodes rapidly and as a result their capacity per node is limited and they quickly consume available data center floor space. The modern Big Data Archive will need a very dense architecture to maximize capacity on a per node basis and not waste that floor space. In these environments storage (disk drives) has practically become less expensive than the sheet metal (the other components in each node) that surrounds them. Thus, making it critical to use each node to its full potential before adding another.

High Throughput

Big Data Archives must also have the ability to ingest large amounts of data quickly. Legacy archive solutions were designed to have data trickle into them over the course of time. Big Data Archives may store very large numbers of different sized files on an ongoing basis. There can be millions of small files that are being archived from traditional Big Data project or a relatively few, very large rich media files being archived from user projects.

In both cases the ingestion of these files requires that the receiving nodes encode the data and then segment it to the other nodes in the cluster. This background work could cripple legacy archive solutions whose nodes are typically interconnected via a 1GbE infrastructure. Instead, a higher speed backbone is required so that additional throughput can be maintained. Solutions like Isilon’s NL Scale Out NAS connect via an internal Infiniband backbone for very high throughput performance, enabling them to sustain ingest rates that match the requirements of a Big Data Archive.

Fast Retrieval

Retrieval is also different for the Big Data Archive than it is for the traditional archive storage system. It may need to produce thousands or millions of files very quickly or in some cases it may be desirable to actually perform the search and analysis on the Big Data Archive itself.

Traditional archive architectures and legacy second tier storage systems are typically found lacking when asked to provide data quickly as the capacity scales over 1 PB. It’s important to remember that archive systems were designed to provide performance better than the platform they were replacing, which for most was optical disk.

Big Data Archives operate against a different standard. They need to provide consistent performance that’s comparable to most primary storage systems, no matter what the capacity level. Again, Isilon’s NL series surpasses this expectation and provides near primary storage performance but with the throughput and density that Big Data Archiving requires.

Protection & Disaster Recovery

Protecting 1PB+ environments requires a change in thinking. Nightly backups are no longer a reality, not only because of the size of the solution but also because of the amount of data that can be ingested at any time. If a large archive job is submitted, and then later a catastrophic failure occurs, a significant amount of data could be permanently lost. For example in the case of machine generated sensor data there may be no way to ever recover it.

Data protection needs to be integrated and then augmented into the Big Data Archive. First the system should have no single points of failure and the users should be able to set the data protection level by data type. This would accommodate unrecoverable data, like point-in-time sensor data, which might need a higher level of redundancy than traditional file data.

Next, the data needs to be transferred in real time to a second location via built-in replications tools. That data again needs be prioritized based on whether it can be replaced.

Finally, there are always some organizations that will want to move to an alternate device all together, even tape, in case of a regional disaster. The Big Data Archive should have the ability to add “copy out” performance when needed. As an example Isilon can add a class of nodes to their cluster called, “backup accelerators”, that are specifically designed to move data to another device. This allows the other nodes to continue to deliver high throughput and fast retrieval while the cluster gets its data copied to alternate storage devices.

Summary

The Big Data Archive can be a component of a larger Big Data project or it can be an archive designed specifically for Big Data. In either case leveraging that investment to also include human generated data that needs to be stored for mining or compliance reasons is an excellent way to achieve a greater ROI on the Big Data project. It can also help discover new ways to make better decisions by retaining and analyzing existing information.

About George Crump

George Crump is the Chief Marketing Officer at VergeIO, the leader in Ultraconverged Infrastructure. Prior to VergeIO he was Chief Product Strategist at StorONE. Before assuming roles with innovative technology vendors, George spent almost 14 years as the founder and lead analyst at Storage Switzerland. In his spare time, he continues to write blogs on Storage Switzerland to educate IT professionals on all aspects of data center storage. He is the primary contributor to Storage Switzerland and is a heavily sought-after public speaker. With over 30 years of experience designing storage solutions for data centers across the US, he has seen the birth of such technologies as RAID, NAS, SAN, Virtualization, Cloud, and Enterprise Flash. Before founding Storage Switzerland, he was CTO at one of the nation's largest storage integrators, where he was in charge of technology testing, integration, and product selection.

Tagged with: Archive, Big data, Data center, Data mining, Scalability, Storage
Posted in Article