Rethinking the Storage Controller for Unstructured Data

Conventional wisdom suggests that most business data is stored in database applications, however, unstructured data comprises approximately 70-80% of the total data in a typical environment and, according to some industry sources, is growing 5x as fast. File-based data is also increasing in value as this information is often mined for decision support in business analytics systems and is sometimes required for legal discovery as well. As a result, mid-sized or Tier-2 data centers have pressing needs to effectively manage and centrally control their unstructured data repositories.

The Unstructured Data Problem and Opportunity

Unstructured data generally consists of user created files like emails, word processing documents, SharePoint data, PDF files, and spreadsheets as well as digital pictures, graphics and videos. But it also includes other information that users generate, such as mobile phone GPS records, social media and videos and photos from cell phone cameras.

In order to monetize their information assets and gain a competitive advantage, organizations are increasingly leveraging these growing sources of unstructured data to help facilitate intelligent, real time business decisions. For example, a traveler searching for low cost airfares online can simultaneously be presented with a myriad of car rentals, lodging choices and Groupon deals for a variety of restaurants in their destination city.

Similarly, a business may be able to extract valuable insights into the seasonal buying patterns of their clients to make better decisions about product inventories in warehouse locations or retail outlets. In short, the applications for leveraging unstructured data are practically limitless. Consequently, the need to retain and access this information to support current and future business decisions is essential for competing in the global, web enabled marketplace.

Challenges for Tier-2 Data Centers

Mid-sized data center environments are experiencing the growing pains associated with absorbing these vast amounts of unstructured data that are creeping into their storage infrastructures. Unfortunately, they don’t have the staffs or the budgets to continuously upgrade their storage platforms to include the kinds of enterprise feature functionality that they may need, like data deduplication, continuous data protection (CDP) and high availability clustering. Nor can they arbitrarily purge data to make up for the shortfall in available capacity.

As a result, traditional storage systems simply cannot meet the scale-out storage requirements of Tier-2 data center environments. Adding NAS arrays in a serial fashion, results in the sprawl of storage ‘islands’ in the data center adding capital and operational expenses.

Furthermore, since there is no global file system for maintaining a centralized index of where data resides, regardless of the underlying hardware array, it’s practically impossible to facilitate real-time analytical queries to support the critical business decision support processes mentioned earlier.

Scale-Out NAS

Scale-out NAS systems address many of the issues associated with NAS sprawl since they enable multi-PB, single-file system deployments. While effective, these solutions are generally not optimal for Tier-2 data centers because their initial storage footprint can actually be greater than the capacity requirements of these mid-sized organizations. In other words, the ‘buy in’ required for most scale-out NAS solutions is too high for many Tier-2 environments.

Object Storage Systems

Object storage is highly efficient in that it’s typically deployed on lower cost, commodity storage platforms and often uses a sophisticated data protection scheme (compared to RAID), called “erasure coding”. This technology provides for a high level of data redundancy, that includes data striping and mirroring across multiple platforms, and consumes far less overhead than conventional RAID. Object storage also delivers a very low cost per GB, making it particularly well suited for the storage capacity and I/O workload requirements of unstructured data farms.

One of the most important advantages of object storage is its ability to be filled to maximum capacity without creating an impact on performance. Traditional file systems get bogged down in their own metadata hierarchy and as they cross the 50% utilization mark, can begin to see performance degradation.

The challenge with object storage solutions, however, is that the interface for accessing these resources is written in a vendor’s proprietary software format. Data centers either have to re-code their applications to utilize the vendor’s application programming interface (API) to address the backend object store, or utilize a cloud gateway appliance to store data on the object storage platform.

For Tier-2 data centers, neither option is ideal. These companies don’t typically control their application code base and deploying a cloud gateway appliance introduces additional cost and complexity.

Cloud Storage

Large cloud service providers (CSPs) like Amazon and Google have created massive object storage infrastructures that offer relatively low entry points for storing data into their clouds. However, the same challenges as described above for integrating existing business applications to utilize object storage apply here as well.

Furthermore, most Tier-2 data centers have too much unstructured data to make cloud storage adoption an economical proposition over the long term. As their unstructured data grows, so will the recurring monthly charge for storing that information in a CSP’s environment.

Scale-Out Cloud Storage

Tier-2 data centers need the attributes of a scale-out storage infrastructure that also incorporates the cost efficiency benefits of object storage. To fit into the budget of the Tier-2 data center, however, the scale-out platform should be flexible enough to start small but grow large as business demands. For instance, the ideal scale-out platform could be configured as a single node with tens of TBs and then expanded into a logical ring of multiple nodes supporting hundreds of TBs.

Moreover, to simplify deployment and access to these storage resources, the scale-out system should also provide NAS access with a global file system, utilizing standard CIFS/SMB protocols. It should also be “discoverable” as a public share within minutes after being attached to the network – without re-configuring file systems or keying in IP, DNS addresses or user names. In short, adding storage capacity should be as simple as plugging in a drive or a network cable to a new node and powering it up.

Feature Rich/Low Cost

From a data protection perspective, the system should incorporate all the high end features normally found in enterprise storage technologies, like snapshots, continuous data protection (CDP), in-line data deduplication, encryption and replication. Furthermore, these data management capabilities would operate completely behind the scenes without requiring day-to-day management or care and feeding.

To further simplify administrative access, the scale-out environment would be completely manageable from a cloud based software application. Whether storage was located in a single site or dispersed across dozens of locations, it could be centrally managed from a web browser anywhere in the world. Furthermore, some lower level administrative functions, like enabling self-service data restores, should be incorporated into the system to relieve help desk staff and improve end user satisfaction.

Companies like Exablox have built this type of scalable, NAS-based cloud storage architecture. These solutions are enabling mid-sized businesses to protect, manage and scale-out large repositories of unstructured information without resorting to the usual (and expensive) NAS forklift upgrades. They incorporate all the higher level data protection feature functionality at no additional premium and capacity can be flexibly added across a variety and mix of drive types ranging from SAS, SATA and SSD.

Furthermore, since this architecture has been specifically engineered from the ground up to be simple to manage, a single IT administrator with minimal training, could easily deploy and run hundreds of TBs of storage on a part-time basis – resulting in decreased IT labor costs.

Conclusion

Mid-sized data center environments face many of the same storage management challenges as their counterparts in the enterprise space but have far fewer people and capital resources to address their needs. With the continued growth of unstructured data, mid-sized data centers may very well be at a crossroads in their search for ways to effectively and affordably manage, store and protect this information now and well into the future.

New storage paradigms like Exablox’s scale-out NAS object storage, provide a viable option for resource constrained data centers to contain the deluge of unstructured data hitting their environments without adversely impacting the bottom line.

Exablox is a client of Storage Switzerland

Unknown's avatar

As a 22 year IT veteran, Colm has worked in a variety of capacities ranging from technical support of critical OLTP environments to consultative sales and marketing for system integrators and manufacturers. His focus in the enterprise storage, backup and disaster recovery solutions space extends from mainframe and distributed computing environments across a wide range of industries.

Tagged with: , , , , , ,
Posted in Article

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 17.4K other subscribers
Blog Stats
  • 1,979,429 views