The modern data center today is faced with storing, managing and protecting an ever-increasing torrent of data, most of which needs to be stored for indefinite periods of time due to various government rules, regulations and laws. File counts have soared from hundreds of thousands past hundreds of millions and now with IoT (Internet of Things) machine generated data from sensors, surveillance cameras, etc., to billions of files and more. Today, organizations are not only struggling with the inability of legacy NAS systems to keep pace with the ever-increasing flood of unstructured data, they also must deal with the reality that in spite of what storage vendors may claim, data loss in storage systems is unavoidable.
To meet the challenge of storing ever-increasing amounts of unstructured data, many organizations have turned to object storage, a technology that has been around for quite a few years that is able to efficiently handle very large quantities of data. And, while data loss may be unavoidable, by selecting the right solutions and applying appropriate parameters, it is possible to minimize significantly both the probability of data loss occurring as well as the size of the data loss.
Physical devices like hard drives (HDD) and SSDs (Solid State Drives) can and do fail. Hardware failures can be caused by power loss to the disk while it is writing data, Undetectable Bit Errors while data is being written, and/or Unrecoverable Read Errors (URE). Software can malfunction due to power failures during operation or flaws in the program and here is also the human factor, human error or malicious intent. All of these can all result in data corruption or loss.
The First Line of Defense
Data is most vulnerable when the first instance is created and written to the disk storage system. At this point there is only one copy of the data, which has not been duplicated anywhere else in the system. The data needs to be written to stable storage, such as hard disk or SSD, which does not require power to retain the data once it has been written to the media. But, because hard disks and SSDs can fail or malfunction, the data also needs to be written in a manner that allows it to be restored and used again. The most common method to protect data from corruption or loss is the use of some type of parity based scheme at the disk level, like RAID in legacy file-based systems, or erasure coding (EC) in object-based storage systems. These methods allow the system to reconstruct data from a failed device.
RAID Reaches its Limit
For many years, RAID has been the method commonly used to protect data at the disk level in legacy NAS systems, with RAID 5 or 6, being the most common methods used. In a RAID system, data blocks are striped across the drives in the array and a parity checksum of all the block data is computed and written to one or two drives depending on RAID configuration. In the event of a drive failure, the system can rebuild the data from the missing drive on a replacement drive.
Unfortunately, as the disk to be rebuilt gets larger in capacity, it takes longer and longer to rebuild the missing drive. With new drives having a capacity 6 to 10 TB, rebuild times can take 24 to 36 hours per TB. In a multi-terabyte system, it could take days to rebuild the failed drive. During this time, data is at risk. Should another drive fail during the rebuild, the data on the failed drive is lost. Additionally, the array performance is significantly degraded during the rebuild process. Clearly, RAID has become problematic as a data protection method against hardware failure.
Erasure Coding as an Alternative
Object storage can handle storing almost unlimited quantities of data efficiently, by using a parity based protection method called EC to protect data at the object level. In EC, data is broken into fragments, which are then expanded and encoded with redundant data pieces. These are then written across a set of different locations or storage media. The advantages of EC are that it is more resilient than RAID 6, and only the corrupt data needs to be recovered, versus an entire volume, so the chance of overlapping failures is significantly reduced. It also consumes less storage, typically only requiring 20-40% more capacity. The downside to EC however, is that it can be more CPU intensive which can result in increased latency.
Replication Adds Data Redundancy
While erasure coding is an effective protection strategy, it is not necessarily suitable for all types of data. The simplest form of protection that insures data redundancy is replication, which can create and maintain additional copies of the original object locally for fast retrieval in the event the original object is somehow lost. An ideal system will apply replication or EC automatically based on the data sets being protected.
The Second Line of Defense
A common data protection standard for many organizations is to have at least three copies of the data stored in three different locations. One copy is stored off-site and preferably off-line, to insure that at least one copy of the data is secure from potential hacking, corruption or tampering. In the past this was accomplished with a strong backup server running enterprise class backup software and writing the data to high performance disk based storage devices and/or large tape libraries. But today’s massive quantities of data make this legacy backup method impractical.
First, it would take too long for legacy backup products to “walk” a massive file system with millions to billions of files and directories in order to identify data that needs to be backed up and then actually back up all that data. With the organization needing to access all its data at anytime, the old, traditional “backup windows” of eight to twelve hours or more, are impractical. Second, even enterprise class legacy backup products would “choke” when the file count hit a million or more files. They simply couldn’t handle such large quantities of files. Backup products had to find a different way to handle so many files.
But a flexible object-based SDS (Software Defined Storage) system built using commodity hardware, can provide a cost effective method of storing and protecting massive amounts of data safely across various storage tiers, including cloud storage, while also managing the lifecycle of an organization’s data.
Properly protecting very high file count systems is a key requirement for all organizations today, especially in light of the ever-increasing laws and regulations governing data protection and retention. Legacy NAS and the traditional file system are no longer able to keep pace with such massive quantities of data consisting of millions, or even billions of files. Storing this much data efficiently and reliably requires a new solution. A strong object storage system, with replication and erasure coding, is well suited for the task of storing and protecting such massive quantities of data in an efficient and safe manner.
Caringo was founded in 2005 to change the economics of storage by designing software from the ground up to solve the issues associated with data protection, management, organization and search at massive scale. Caringo’s flagship product, Swarm, eliminates the need to migrate data into disparate solutions for long-term preservation, delivery and analysis—radically reducing total cost of ownership. Today, Caringo software is the foundation for simple, bulletproof, limitless storage solutions for the Department of Defense, the Brazilian Federal Court System, City of Austin, Telefónica, British Telecom, Ask.com, Johns Hopkins University and hundreds more worldwide. Visit www.caringo.com to learn more.