When organizations begin to explore a Hadoop initiative their goal is to create an environment that can help them make better decisions. The more data that the Hadoop environment has access to and the faster that data can be processed, the more accurate those decisions can become. Of course all this data to be processed needs to be stored and done so cost effectively. Object storage may be the ideal platform from which to host a Hadoop environment.
The primary tenet of Hadoop architectures is to bring the compute to the data, not transfer the data to the compute. This tenant creates two challenges, both of which an object storage based Hadoop environment can solve. First, Hadoop provides a file system that provides access via an S3 interface eliminating the batch extract and ingest stage in processing jobs and allowing the Hadoop Distributed File System (HDFS) direct access to S3 compatible object storage systems like Cloudian’s HyperStore. This allows the Hadoop infrastructure to assume that it is running in its normal configuration. All the special capabilities that object storage brings happens behind the scenes.
The Data Lake Challenge
The second challenge that Hadoop presents to the organization is collecting all the data into a single repository. As mentioned above the more data that can be held within the confines of the Hadoop Infrastructure the more accurate its analysis can become. This requires a cost effective storage infrastructure that is accessible from a wide variety of protocols, something that the industry has dubbed a “data lake”.
The problem is that it is difficult to create a data lake out of Hadoop’s default configuration. First, it does not offer data accessibility by more standard enterprise protocols like NFS and CIFS. Second, from a data protection perspective Hadoop, in its default configuration, makes three copies of all data. This means that in an environment where there is five petabytes (PB) of information to be processed, 15PBs of space needs to be allocated. Even though Hadoop is designed to use commodity hard disks, at a 3X consumption rate its storage requirements can become quite expensive.
Another issue with creating a Hadoop Data Lake is object scalability and durability. Native HDFS can not scale to trillions of data objects and does not support data center replication. This means that this data needs to be stored in a different pool of storage for safe keeping and then subsetted as it moves into the HDFS data lake.
Object Storage Perfects the Data Lake
Object storage addresses the challenges of creating a data lake. Some object storage solutions support access to the object storage cluster from a variety of protocols including native object, NFS and CIFS as well as a variety of data sizes. This means that no matter what creates the data, or how it transfers data, it can land on the object storage cluster. And, again since solutions like Cloudian provides HDFS the S3 support it needs HDFS can now be included in an Object Storage based data lake.
The other data lake challenge, cost, is also resolved via an object storage cluster like Cloudian’s. These systems also can leverage commodity servers to act as storage nodes and can leverage commodity hard disk drives to install inside them. But unlike a default Hadoop architecture they don’t require the 3X copy strategy.
Instead some object storage systems leverage a capability called erasure coding. Think of erasure coding as more efficient than a RAID algorithm designed for scale out storage architectures. Where RAID is limited to a drive level of granularity, erasure coding works below the object layer and creates its parity information there. This means that a drive failure does not force a beginning to end recreation of what was on that drive, only the actual data on that drive. Erasure coding also allows for a more tunable data protection layer that can be set to the value of data.
Another advantage of an object storage based data lake is that it addresses the scalability and durability challenges of HDFS. With an object storage system like Cloudian’s HyperStore, the Hadoop can access a data lake without scalability limitations. The object storage data lake can also provide data integrity checking as well as data center to data center replication.
Finally an object storage based data lake provides a REST based API. This allows for data lake administrators to automate data management since data can be accessed and manipulated in a scripted or programmatic fashion.
Better Metadata Protection
The other advantage of combining object storage and Hadoop is that it provides better protection for the master node’s data. As we outlined in our article “What is Hadoop?”, the master node stores all the metadata for the Hadoop infrastructure and the results of the MapReduce output. In the default Hadoop configuration great steps have to be taken to maintain high availability of the Hadoop Master.
When Hadoop is combined with object storage the protection of the Master’s metadata and MapReduce results are automatically protected via the same architecture. If the Master node fails, it simply needs to be restarted on another node and it can access the same data as the original master.
Object Storage Before Hadoop
Object storage provides so much value to the Hadoop architecture that it is reasonable to state that data centers should start building their object storage cluster first. They can use this as an initial storage platform for all the data that they are collecting from sensors, output from database reports and even use it to replace file servers.
This gives them the platform for their data lake so that when they are ready to invest in Hadoop, the data that is needed to perform an analysis is already centrally located and ready for access. With object storage they even have the pre-positioned compute power needed to drive the Hadoop closer.
Sponsored by Cloudian
EMC’s data lake approach with Isilon provides muliti protocol access including HDFS.