The Myth of NoSQL Data Protection

Posted on September 20, 2018 by George Crump

There is a debate circling the topic of NoSQL data protection. A common myth is that because these environments automatically create replicas, data protection isn’t needed. Replicas protect the NoSQL environment from a hardware failure within a node or even a whole node failure. However, replicas do not protect the environment from data corruption caused by faulty code or malicious malware like ransomware. If one of these situations occurs, the replicas make it worse.

While the replica concept is an ideal protection strategy against hardware failure, it does not provide point-in-time protection like a more traditional backup. The problem is that NoSQL environments typically work on an eventually consistent model, so backing them up in a consistent state is challenging. Legacy backup vendors didn’t design their solutions for massively scalable, eventually consistent workloads.

The Challenges of NoSQL Backup

There are three key challenges to the typical NoSQL environment. The first is the volume of data in these environments. A single backup server can’t scale large enough to protect a large NoSQL environment adequately.

The amount of data also impacts the backup technique. The legacy method of performing a full backup once a week won’t work when that full backup may exceed more than 1 petabyte. Backups need to incrementally backup these environments but in an intelligent way.

NoSQL environments are also distributed and may have hundreds if not thousands of nodes in their architecture. Placing a backup agent on each node is impractical. The distributed nature of the NOSQL cluster also means that administrators are adding and removing nodes all the time. The legacy method of backing up a specific server won’t work.

The Requirements of NoSQL Backup

The number one requirement of a NoSQL backup solution is it must scale to match the scalability of the environment it is protecting. Meeting the scale challenge requires an agentless scale-out architecture that utilizes techniques like deduplication to reduce storage requirements.

The second requirement is the NoSQL backup application must be data aware. The backup application needs to understand the nuances between Cassandra, Mongo DB, Hive and others. Data awareness enables the NoSQL backup to run agentless, and it enables granular recoveries. It also allows the solution to determine database consistency. Finally, data awareness also provides better deduplication rates. If it understands the actual data, the NoSQL backup application can better determine data redundancies.

The third requirement is for the solution itself to incorporate machine learning. Machine learning can detect backup anomalies like unprecedented data change rates caused by ransomware attack or user deletions.

Beyond Backup

In addition to data protection, these environments need data orchestration to move data for cloud migration or test/dev purposes. Data Awareness helps with data orchestration, for example when making a copy for test purposes the solution can mask private information out of the development copy.

Finally, these NoSQL environments also need automation. The data volumes common in NoSQL environments make it almost impossible for a human to manage. The software needs to determine the best way to achieve goals like RPO and RTO as well as help with scheduling functions like when is the best time for backups.

StorageSwiss Take

NoSQL backup is a must. The replica protection strategy only protects against hardware failure. The risk to data from user error or malicious attack is much higher. The problem is that using a tried and true legacy backup solution won’t work, since these systems are so distributed. It takes an application written from the ground up to protect NoSQL environments.

Watch our most recent Lightboard video “How to Backup NoSQL” to learn why NoSQL backups are so critical and what the requirements of these applications are.

About George Crump

George Crump is the Chief Marketing Officer at VergeIO, the leader in Ultraconverged Infrastructure. Prior to VergeIO he was Chief Product Strategist at StorONE. Before assuming roles with innovative technology vendors, George spent almost 14 years as the founder and lead analyst at Storage Switzerland. In his spare time, he continues to write blogs on Storage Switzerland to educate IT professionals on all aspects of data center storage. He is the primary contributor to Storage Switzerland and is a heavily sought-after public speaker. With over 30 years of experience designing storage solutions for data centers across the US, he has seen the birth of such technologies as RAID, NAS, SAN, Virtualization, Cloud, and Enterprise Flash. Before founding Storage Switzerland, he was CTO at one of the nation's largest storage integrators, where he was in charge of technology testing, integration, and product selection.

Tagged with: Cloud, Corruption, Deduplication, Hadoop, Imanis Data, Machine Learning, NoSQL, Point-in-Time, Ransomware, Replication, RPO, RTO
Posted in Blog