Breaking Down Hadoop, Spark, Cassandra Silos – DriveScale Briefing Note

The storage architecture of most next generation applications, like Hadoop, Spark and Cassandra, leverage local, direct attached storage to avoid excessive storage traffic on the network and keep costs down. While this architecture does accomplish its goals, it also re-creates the problem most common with any direct attached storage architecture; lack of efficiency. But shared storage is not the answer for many of these environments either. That’s because it increases costs and network traffic. Modern data centers need a solution that can strike a balance and deliver a cost effective, efficient solution that meets the performance and data locality demands of these environments.

The Direct Attached Storage Problem

There are several problems that emerge with a direct attached storage approach. Most data centers will have more than one of these modern applications and as a result they end up with multiple clusters, each with their own dedicated storage. Invariably, one cluster is constantly running out of capacity while others have plenty. The problem is there is no way to redistribute this storage.

Another challenge comes when the request for another cluster or node is made. Each request has to be serviced separately and often leads to a new work order and a new cluster. It is not “as a service.”

Finally, the direct attached storage model means the organization is always buying compute and storage at the same time, which leads to further resource inefficiencies. In many cases, IT has to add nodes to meet the capacity demands of the environment, not because it needs more compute power. In others, the environment needs more compute power but not more capacity. The two resources rarely scale in lock step. The result is the typical modern application infrastructure is utilizing less than 30% of the available resources.

The Shared Storage Problem

Although an increasing number of these modern applications support shared storage, this type of storage brings its own problems. Most noticeably increased expense and network traffic loads. In some cases, the increased demand on network bandwidth can be overcome. But doing so further increases expense and complexity.

There is also a concern over quality of service, ensuring that each node in the cluster gets the performance it requires in a consistent fashion. Since most shared storage systems have every node, or at least most nodes in the cluster sharing the same LUN or volume, then achieving consistent quality of service is difficult without extensive planning.

Striking a Balance with DriveScale

DriveScale provides a software composable architecture that binds storage in the form of JBOD (just a bunch of disks) storage shelves to diskless (or disk-lite) compute nodes, maintaining the data locality required by these workloads while providing a more agile and efficient infrastructure. The solution consists of a DriveScale adaptor, which acts as a SAS-to-Ethernet bridge connecting the JBOD drives and compute nodes within each rack, along with software to discover, compose and monitor the storage and compute resources and the clusters that are created from those shared pools. To the application stack, the “composed” nodes look and perform like a standard server, but unlike direct attached storage, the drives and compute nodes can be re-assigned as needed to other clusters. The result is a highly scalable architecture with minimal network load and very low cost, but with increased efficiency since storage capacity isn’t permanently locked to a specific node or cluster.

DriveScale also separates the scaling of compute from the scaling of capacity. IT can order nodes with boot drives only, then attach them to the required amount of capacity. Servers that are compute focused tend to be less expensive and take up less space than servers that need to accommodate many storage devices. Similarly, DriveScale allows many more drives to be attached to nodes than could physically fit in a server chassis.

IT can use DriveScale’s on-premises management UI to easily compose nodes and clusters on-the-fly, or access and automate the same functionality from their favorite tools by leveraging a RESTful API. There’s also a cloud-based software component called DriveScale Central that gives IT a view of deployments across multiple data centers, as well as a place to get the latest updates and view log files and documentation.

From a performance perspective, the DriveScale solution adds very little latency to the process. In performance benchmark testing, it has shown to add only 1% to the overall latency of the connectivity, coming within a percentage point or two of internal, direct attached drives.

In the latest release, DriveScale adds encryption of data in-flight and at-rest, a critical requirement for some organizations. In addition to encryption, the company is also adding data disposition and shredding so after the deletion of a cluster, all traces of data that were in use can be removed.

StorageSwiss Take

There are a lot of good reasons to leverage shared storage in modern architectures, but the cost and bandwidth demands it introduces makes many modern application infrastructure designers try to avoid them. DriveScale strikes a balance that provides near DAS performance and even lower costs, but still delivers the resource efficiency of shared storage.

Eight years ago George Crump, founded Storage Switzerland with one simple goal. To educate IT professionals about all aspects of data center storage. He is the primary contributor to Storage Switzerland and is and a heavily sought after public speaker. With 25 years of experience designing storage solutions for data centers across the US, he has seen the birth of such technologies as RAID, NAS and SAN, Virtualization, Cloud and Enterprise Flash. Prior to founding Storage Switzerland he was CTO at one the nation's largest storage integrators where he was in charge of technology testing, integration and product selection.

Tagged with: , , , , , , , ,
Posted in Briefing Note
One comment on “Breaking Down Hadoop, Spark, Cassandra Silos – DriveScale Briefing Note
  1. […] The storage architecture of most next generation applications, like Hadoop, Spark and Cassandra, leverage local, direct attached storage to avoid excessive storage traffic on the network and keep costs down. While this architecture does accomplish its goals, it also re-creates the problem most common with any direct attached storage architecture; lack of efficiency. But shared storage is not the answer for many of these environments either. That’s because it increases costs and network traffic. Modern data centers need a solution that can strike a balance and deliver a cost effective, efficient solution that meets the performance and data locality demands of these environments. CONTINUE READING… […]

Comments are closed.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 21,665 other followers

Blog Stats
  • 990,741 views
%d bloggers like this: