Keeping unstructured data available, as data sets continue to grow, is a key challenge for today’s data center. The access requirements of unstructured data often preclude relying 100% on capacity centric tape so higher capacity disk solutions, for many organizations, become the only option. But making sure data is protected and available on very large disk systems using multiple-TB disk drives has become increasingly challenging and expensive.
The Legacy Array Availability Problem
The legacy disk array whether scale-up or scale-out can be configured to handle very high storage capacities at relatively inexpensive prices per GB. But, as Storage Switzerland discusses in this article, both of these topologies have their challenges when asked to support double digit PB storage capacities. The other real concern in these scenarios is making sure data remains accessible.
These systems leverage traditional RAID architectures to maintain data availability which, unfortunately, are not particularly well suited for PB-scale environments. This is especially true when configured with the highest capacity hard drives possible to keep cost per-GB low. And even with these high capacity drives, there is still the requirement to support hundreds (or thousands) of spindles.
Unfortunately, hundreds of high capacity hard drives creates the ‘perfect storm’ for RAID. With this many drives the cloud storage provider has to work under the assumption that there will always be a hard drive failure (the laws of probability). When there is a failure to a hard disk in a RAID group, the RAID controller logic must rebuild the entire disk, sector by sector, even if the drive is mostly empty. The larger the drive gets, the more time it takes to perform this rebuild; with disk capacities of up to 4TBs this can stretch into days.
Long rebuild times plus the likelihood of constant rebuild efforts creates additional challenges for these large scale data centers. First, the longer the rebuild the longer the period of time that the applications or users connected to that array have to endure the degraded performance caused by I/O intensive rebuild traffic. Second, and more importantly, the longer the rebuild time the longer the data center is vulnerable to complete data loss caused by a second or third drive failure (depending on whether RAID 5 or RAID 6 is deployed).
These factors encourage storage designers to create highly redundant RAID systems with powerful controllers. The problem is that doing so leads to reduced hard drive efficiency, increased costs and poor space utilization. Despite these extra steps the systems are still not completely trusted and storage managers will often augment them with disk backup systems, replication and tape. In the highly competitive cloud or service provider market the combination of all of these problems has driven storage managers to find new alternatives.
The Object Storage Solution
The first attraction of object storage is its ability to deal with millions upon millions of objects or files. Object based storage allows providers to overcome a key weakness of traditional file systems which can have file count restrictions due to their limitations with handling metadata. These metadata problems often force large infrastructures to add storage systems prior to using all the capacity available with current systems. Object storage solves this problem with its ability to support a nearly unlimited number of objects.
When it comes to data protection, object storage systems also have the advantage of being granular to the object or file level. This means that a user can control the number of copies of data that are made on a per-object (or group of objects) basis. This is typically done via a replication policy that simply copies objects to other storage nodes in the environment.
With a dispersed storage system, if a disk fails, the system uses another copy of the object stored on a separate node/disk. It then replicates this data to another device in the system to bring the object to full reliability. Performance is also improved because dispersed storage requires no complex XOR function, like RAID has, in order to identify lost files.
The Object Storage Challenge
While a replication based strategy enables simple protection with rapid recovery it does present several challenges to the storage designer. First, this type of “1: X” protection scheme, with X being the desired number of redundant copies, is very capacity inefficient. Assuming that a data center would like to keep three copies so it can still access data even if two nodes have failed, their storage capacity requirements would triple. Further levels of protection only exacerbate the problem.
The second challenge is performance. Since each object copy is uniquely contained on a storage node, performance is limited to the capabilities of that node. In other words per-object performance will not scale as more nodes are added to the environment.
The Dispersed Storage Solution
To solve these problems companies like Cleversafe have leveraged data erasure coding and dispersal algorithms that parse data into multiple segments and then distribute these segments across multiple nodes in the storage cluster. In the case of Cleversafe these nodes can be self-contained, all in one building or data center, or can be geographically distributed.
At a high level erasure coding is similar to RAID except that the parity calculations are applied at the file (or object) level instead of at the disk level. When a rebuild is necessary, the segments that compose a file can be reproduced much more easily than having to rebuild an entire disk drive.
This level of protection can be dialed up or down by increasing the delta between the number of segments generated and the minimum number of segments required to reconstruct the data, which can be based on attribute policies such as age, data type, popularity, etc. The result is a level of protection or an effective redundancy much higher than even the two-drive-failure protection provided by RAID 6, with far less capacity overhead.
Also, multiple storage nodes can deliver their data segments in parallel, which helps with read performance. And, the system has the intelligence to provide data from the closest or fastest set of nodes.
The Dispersed Storage Advantage
Another significant advantage with dispersed storage is that the storage manager is now managing just one copy of data that is fully protected. The algorithm automatically makes sure that enough segments are dispersed both locally and remotely to provide the level of protection it’s configured for.
Compare this to legacy storage where data has to be stored with especially inefficient RAID algorithms, like RAID 10, copied to a secondary local disk backup system, then replicated via a separate process to a remote facility and finally copied to tape for the “restore of last resort” copy.
Dispersed storage simplifies management and data protection all while reducing cost, space, power and cooling requirements. In the cost-competitive provider market, this is exactly what’s required to meet demands, while enabling providers to continue on-boarding new customers without escalating costs.
Cleversafe is a client of Storage Switzerland