Hadoop is an open source software framework licensed by the Apache Software Foundation that uses a distributed compute infrastructure to handle large, batch analytics jobs on very large data sets. It does this by breaking these projects down into a number of smaller component jobs, each of which can be processed (simultaneously) by separate compute engines. While Hadoop brings new capabilities to large data analytics, the cost of storage has the potential to overshadow those gains.
How Hadoop works
Hadoop runs on clusters of compute and storage modules or nodes, typically using low-cost x86 hardware with internal storage capacity. Instead of consolidating processing power and running data from a large, shared repository with a centralized compute engine, Hadoop ‘brings the compute to the data’ by locating storage and computing power together in each Hadoop cluster node. This eliminates the problems created by storage latency during analysis and best supports a distributed, parallel processing topology.
Where Hadoop is used
Applications of this technology are common in very large batch processing environments like consumer behavior data mining or time-based research projects where the effectiveness of the analysis can be governed by the sizes of the data sets analyzed. These use cases typically follow a ‘capture-store-analyze’ workflow where data must be processed into an appropriate format then stored, often in a data warehouse, until it’s analyzed.
In addition to analysis, Hadoop is also being used for Extract-Transform-Load (ETL) applications, the pre-processing of raw data for these data warehouses that support traditional relational database analytics. In these use cases the captured data formats can vary widely, as do the analytics, often requiring a different transformation step for each data source. Hadoop’s distributed architecture and flexibility support the dynamic nature of ETL processes that may be different for each data source.
Hadoop uses the MapReduce function to parse (Map) large projects into smaller sub-jobs that can be processed in parallel on distributed nodes and then recombined (Reduced) into a single output for the project. MapReduce is managed by a Master node in the cluster called the “Job Tracker”. As a distributed architecture, Hadoop requires a distributed file system, called “Hadoop Distributed File System” (HDFS), which is managed by another Master node called the “Name Node”.
Hadoop clusters are designed to be fault-tolerant and need some form of data resiliency, so HDFS creates two additional copies of each data set. This 3-copy requirement means that essentially, the user must store 3x as much data as they need to process. Said another way, only 1/3 of the data that they’re required to store in a Hadoop cluster is of interest, the rest is redundant. This can cause some significant issues in Hadoop environments.
Cost of capacity
Most Hadoop clusters handle very large data sets, a situation that amplifies the ‘overhead’ problem created by this 3x redundancy. The obvious impact is on storage costs, which can get enormous at the petabyte scale common in many Hadoop environments. Three copies of 3TBs is only 9TBs, 3 copies of 300TBs is 900TBs! But there’s an opportunity cost here as well.
Users don’t have unlimited resources and this cost of storage can limit how large their Hadoop clusters can be and ultimately the sizes of data sets they can handle. Due to this inefficiency users often have to ‘pick and choose’ which data they process, a limitation that can impact the accuracy of their analyses since in many Hadoop use cases they’re trying to process more data, not less. The more data that can be processed the more accurate the decisions made from analysis can be. Object storage systems that make use of a data protection method called “information dispersal” can offer a solution to this data redundancy problem caused by HDFS.
Replication and data redundancy are standard methods that storage systems use to provide resiliency. But there are ways to make sure that storage component and subsystem failures don’t put data at risk. Some object storage systems, like Cleversafe’s dsNet, can provide even better resiliency than replication, without the overhead of two extra copies. These systems use information dispersal which combines erasure coding (forward error correction) with the physical distribution or ‘dispersion’ of data.
They first parse a data object into multiple component parts or blocks. Then, somewhat like a parity calculation, each block is expanded with some additional information to create a more resilient superset of data. Next the coded data is split into multiple segments that are written to different physical nodes, typically in a scale-out storage architecture.
With a mathematical algorithm operators can use this superset to recreate the original data set using fewer than the original number of data blocks. Compared with traditional disk arrays using RAID, which can lose only one or two disk drives, information dispersal can provide many more times the level of resiliency within a single storage system. And it does so with much less than the 300% capacity consumed by HDFS. In fact, object storage systems like this typically require only 50-60% capacity overhead.
Object Storage running Hadoop
An object storage system with information dispersal can be used to store data and run Hadoop on that data within the storage system itself. But this requires more than just ‘Hadoop compatible’ storage. The object storage nodes need to be integrated with Hadoop, provide HDFS compatible storage and the CPU power to run the MapReduce process as well. This means the object storage nodes actually comprise the Hadoop cluster itself. An example of one of these systems is Cleversafe’s dsNet Analytics.
Eliminating the 3-copy storage overhead of HDFS directly reduces the amount of storage capacity consumed in the cluster. This can have an immediate impact on storage costs, but can improve the quality of analysis as well. By removing that inefficiency, Hadoop-integrated object storage enables users to process more data within their given Hadoop cluster size generating better results because analytics are often more accurate when run on larger data sets. As an example, if the data being analyzed is time based, processing a larger data set can mean including a longer timeframe in the sample size – using data collected over several years, instead of months.
Hadoop-integrated object storage systems can provide another benefit as well, this time in reliability. HDFS consolidates the metadata handling for its distributed file system on the Name Node, creating a single point of failure. Companies can address this to some extent by duplicating the Name Node but this adds cost and complexity to the cluster. Object-based architectures using information dispersal store the HDFS metadata in the same manner as the data objects. This creates the same highly reliable storage area with no single points of failure, providing a more reliable file system without the need for redundant Name Nodes.
The Apache Hadoop project offers some powerful technology for analyzing very large data sets in a cost-effective open-source framework. It does this by breaking large analytics projects down into manageable pieces and leveraging a distributed, scale out architecture. But Hadoop’s native file system, HDFS, has an inherent inefficiency – it creates two additional copies of each file. This increases storage costs and effectively reduces the size of the data set that can be analyzed. Using object storage systems that are integrated with Hadoop, and feature information dispersal offer a solution to this problem. They can eliminate the overhead of redundant copies and provide better data resiliency than HDFS, while enabling Hadoop clusters to support even larger data sets.