Hadoop and OpenStack for Enterprise Big Data – Moving from Erector Set to Ready To Go

Posted on April 28, 2015 by George Crump

Big Data Analytics is more than just an interesting IT initiative, these infrastructures are starting to re-shape how businesses develop new products, respond to customer needs and achieve new levels of efficiency. The problem is that most Hadoop environments, the technology typically used for these types of projects, are essentially ‘do-it-yourself’ solutions; an ‘Erector Set’ from which an infrastructure needs to be built. For more businesses to take advantage of the data they are collecting, the development environment should become much more of a ready-to-go platform. This allows the IT team to implement the analytics project more quickly, enabling the organization to decrease time to value on its big data investment.

The early analytics adopters have been mostly organizations with significant online presences, companies like Facebook, Amazon®, Google and Twitter that by their very nature (millions of concurrent users) collect enormous amounts of data continuously and have the infrastructures to monetize those data points. Other early adopters have been more traditional organizations such as UPS®, FedEx® and Walmart where, given the number of devices they need to track, immediate value can be derived from the investment. All of these organizations could justify the investment in time and people to put Hadoop and its associated infrastructure Erector Sets together.

That said, even these companies, especially the later group, could benefit from a turnkey, platform approach to implementing a complex analytics infrastructure. While the components for the build-it-yourself method are economical, the time and skill sets required to assemble them are not. Given the cost savings, improved customer satisfaction and increased efficiency that big data analytics can deliver, organizations need a solution that simply needs to be ‘plugged in’, saving the organization time and producing value much more quickly.

The Components of the Hadoop / Big Data Infrastructure

Many Hadoop architectures are based on a hyper-converged model. These scale-out designs consist of a number of server-class systems or “nodes” clustered into a common pool of compute, network and storage resources. Depending on the design this allows for an application to scale horizontally within the architecture but still have node-level access to data for faster processing. At the core of the architecture is either Hadoop or OpenStack, the choice of which will impact the specific cluster file system used.

Because these hyper-converged architectures leverage a multi-tiered, build-it-yourself model, IT professionals need to select which type of servers to use, what networking and which types and brands of storage to use. In the OpenStack instance they will also need to select the underlying file system to use. Even as a more platform-based approach emerges, it is important that the IT professional understand each of these tiers and its significance to the overall operation of the cluster.

Compute Tier

The compute tier is critical for a number of reasons. The most obvious is that this tier will be responsible for running the actual jobs that make up the analytics process. But this is also the tier that will run the Hadoop or OpenStack framework, the cluster management software and the storage services. Depending on the application and the design of the cluster the potential compute power that can be applied to any given application is limited to the performance of a given node. Some applications, like Hadoop, can sometimes be designed to spread that processing across multiple servers. But other times total output is dependent on a single named node server that controls all processing and appends new data. No matter the configuration, per-node processing performance is likely to be a concern for most data centers.

The second consideration is the internal storage capacity of the servers that make up the compute tier since this local capacity will be the storage infrastructure that supports the analytics environment. Again, a basic goal of many analytics processing designs is to keep the data local to the node doing the analysis. As a result the raw capacity of these nodes needs to be large enough to support these jobs. Some of this capacity will also be used for data protection, so additional overhead should be factored into the raw capacity requirements. IT planners should look for configurations that can store about 50TBs raw.

Storage Tier

In a hyper-converged Hadoop implementation, housing the physical storage media is the responsibility of the compute tier, the storage software does the actual data management. There are two primary expectations of this tier, that the data be protected and the performance of that storage be acceptable for a given workload.

Protection in hyper-scale environments typically comes in two forms, replication or erasure coding. Both of these methods allow for the use of very high capacity hard disk drives, since they can rebuild data at a sub-drive level and are designed to be used in multi-node architectures.

Replication makes a user-defined number of copies of a data set on another node, depending on its criticality. Considering this is typically set at three or more copies of data, capacity consumption can be an issue for some environments. Alternatively, erasure coding is a parity based protection scheme that uses less capacity for protection (typically 30% for a 3-way protection scheme). But data is no longer local to the node so all data accesses need to come across the network.

Another component of the storage tier is its ability to accept data from a variety of input streams. Depending on the environment this can include social media, web log and click streams, sensor data from the internet of things, email and document data as well as legacy data warehouse information. Hadoop and the analytics processing applications will work across these data sets to deliver the desired results to the organization. Each of these input streams will typically access the storage tier in a variety of ways. As a result it is important that the storage tier support protocols like NFS, REST APIs and OpenStack Swift.

Network Tier

Networking is often an overlooked part of the hyper-scale design, but one that’s critical to the success or failure of the project for some important reasons. First, any multi-node cluster generates a lot of network traffic, especially a hyper-scale environment that can easily surpass 50-node configurations. The inter-node communications alone in clusters this large are extensive and require a deterministic, low latency network infrastructure for cluster stability. Added to the basic cluster communications are the other expectations of a hyper-converged architecture. These include the rapid ingest of large data sets, often consisting of millions of small files, the constant replication of data for data protection and the massive data transfer between nodes when there is a drive or node failure.

Software Tier

The software itself will consist of a variety of components, most notably Hadoop and it’s ecosystem. But the software tier will also need to run specific analytics applications like Pentaho[FJO1] [Office2] , Tableau, and Datameer. These analytics applications, combined with tools that support data blending from traditional data architectures to modern data architectures (e.g., NoSQL or MongoDB) are key to successful Big Data deployments. At the same time the platform should have the ability to quickly host web applications that have become popular due to the widespread availability and convenience of using a web browser as a client.

The Value of an Integrated Platform

The components of a Hadoop / Big Data analytics platform are numerous. Assembling all these parts and making sure that they cohesively interoperate is no small feat. And for many organizations it becomes THE barrier to entry. The effort required to pull all the components together and to interface with the organization’s various data sources and big data project goals can be extensive, and very time consuming, often taking over a year to deploy and reach an operable state. And in most cases, no two Hadoop infrastructures are the same.

The enterprise needs a fast “time to value” solution that eliminates much of the upfront integration and assembly, allowing them to start analyzing data within weeks instead of years; in other words, a platform approach. This platform should have the ability to become the centerpiece for storing the unstructured data that is used in analytics processing while still providing access to legacy data warehouse systems. They essentially need to create a “data lake” that can be feed from multiple sources independently but analyzed universally by various applications.

The Components Are Key to the Platform

As stated above, even when an enterprise has decided on a platform approach to the analytics problem, understanding the components is still key to selecting the right platform. The platform needs to have plenty of per-node processing power and plenty of per-node capacity. This not only allows better local processing of data, but also limits the number of nodes that need to be purchased. Some platforms are guilty of “node sprawl” where either the processing power or capacity of each node is too small and nodes have to be added prematurely. There has to be a balance between maximizing each node to its fullest and providing the ability to scale to a high number of nodes.

Data protection is also critical. As the analytics platform evolves into a data lake, it may become the sole repository for some data sets. Protection and retention of that data become the responsibility of the platform. Again, a replication strategy is ideal since full instances of data are redundantly stored on different nodes, which compliments both the need for local processing and data availability.

It is also critical that the platform leverages a high performance network that can handle inter-node and inter-application communication as well as rapidly shift data locations when required. As an analytics cluster scales toward 100 nodes the network communication becomes a critical factor in terms of both performance and reliability. This is not the part of the infrastructure to skimp on or go with the cheapest possible networking option. Instead, platform vendors should provide tier-one low latency networking, high performance and of course, reliability.

Taking the Platform Further

While the analytics platform described above is a great start, most organizations are going to need more from providers of these platforms. There are specific applications in verticals like telecom, healthcare, IT, surveillance, oil & gas, automotive and others that can deliver enhanced analytics for these industries and simplify the implementation as well. The platform provider needs to have relationships with experts in these markets to complete the platform picture. Doing so will allow for a nearly turnkey implementation and allow organizations to derive immediate value from their big data platform investment.

Conclusion

A big data project can deliver a potentially larger ROI than any IT project that an organization has or will ever invest in. It is that much of a difference maker and can reshape an entire organizational IT strategy. And “data” isn’t just part of the name, data is the heart and soul of any big data project. It is easier than ever to gather that data. But the big challenge, one that analytics addresses, is understanding what all this collected data is trying to say. Hadoop tries to correlate all this data to show potential new strategies and directions that were not apparent in the past.

The IT infrastructure that stores and enables this data to be processed is essentially the ‘circulatory system’ that feeds the Hadoop and analytics solutions. But building that infrastructure has been the barrier to entry for most organizations beyond the extremely large and data rich. A platform approach becomes the great equalizer for organizations that can’t access the resources or the skills, or don’t have the time to create an analytics solution that will start to deliver value. The key though is to find a platform that does not compromise on the components and instead delivers a better experience with higher performance and higher data reliability.

Sponsored by Hitachi Data Systems

About George Crump

George Crump is the Chief Marketing Officer at VergeIO, the leader in Ultraconverged Infrastructure. Prior to VergeIO he was Chief Product Strategist at StorONE. Before assuming roles with innovative technology vendors, George spent almost 14 years as the founder and lead analyst at Storage Switzerland. In his spare time, he continues to write blogs on Storage Switzerland to educate IT professionals on all aspects of data center storage. He is the primary contributor to Storage Switzerland and is a heavily sought-after public speaker. With over 30 years of experience designing storage solutions for data centers across the US, he has seen the birth of such technologies as RAID, NAS, SAN, Virtualization, Cloud, and Enterprise Flash. Before founding Storage Switzerland, he was CTO at one of the nation's largest storage integrators, where he was in charge of technology testing, integration, and product selection.

Tagged with: Big data, Capacity, Data Protection, Hadoop, HDS, Hitachi Data Systems, Hyper-converged, Infrastructure, OpenStack, Reliability, Scale-Out
Posted in Article