Protecting MongoDB, Cassandra, Hadoop – Datos IO Briefing Note

Posted on December 13, 2017 by George Crump

Data center modernization usually includes the organization moving to modern cloud applications like MongoDB, Cassandra and Hadoop. Like most new initiatives a forgotten element is data protection. These environments are particularly challenging to protect because they are designed to run within a cluster where the application is horizontally partitioned (shared) across nodes in the cluster. While this design is ideal to leverage the compute cluster, it creates unique challenges for the data protection process.

The Protection Challenge of Modern Applications

From a data perspective, modern applications leverage an “eventually consistent” data storage technique. The use of this method means that creating a known good copy is difficult since no one node always has the latest copy of data. In theory, the only way to capture a clean copy is to stop all writes to the cluster, let all nodes synchronize and then copy the data to backup storage.

Another challenge protecting modern applications is lack of understanding of the need for protection. Modern applications are designed to be distributed across sites and many IT professionals assume that they are “self-protecting.” The problem is that if one of them has an outage, and another site takes over, the database is in an inconsistent state. To get the database consistent can take days or even weeks. The database will eventually recover, but it may need to be offline while the recovery process takes place. Additionally, some organizations believe that their data is protected because they use replication. In fact, while replication is good for supporting availability requirements, the reality is that it does not protect against basic issues like user error, ransomware, and data corruption. For this you need a point-in-time backup of your data from which to recover.

Enter Datos IO

Datos IO RecoverX is a cloud data management platform designed specifically to provide data protection for these next-generation applications. It supports non-relational databases like MongoDB, Datastax, and Apache Cassandra, as well as big data file systems like Cloudera and Hortonworks. RecoverX can backup these applications running in almost any cloud and then store them in that cloud, another cloud or even your data center.

Datos IO RecoverX is built upon the company’s Consistent Orchestrated Distributed Recovery Engine (CODR). CODR runs application listeners that monitor the semantic changes to the database. It captures the changes through publicly available APIs to talk to things like the MongoDB Oplog or Cassandra CQLSH. The first backup is a full copy of the database.

RecoverX then continues to “listen” to database changes across the entire cluster, and then can replicate those changes to storage it controls. It stores backup copies in a native format enabling IT to restore portions or the entire database without any referential integrity issues avoiding the need for the lengthy repair process. Since RecoverX’s design from its outset was to backup distributed, scale-out databases, Datos IO designed it also to be scale-out, so it can scale with the environments it protects.

Datos IO is quick to mention that a key element of its products is semantic deduplication. Most deduplication is block-based, and most backup dedupe is done by first unpacking the backup format, and then applying a block-based deduplication algorithm. Semantic deduplication looks at the actual changes, such a single element in a single table providing a much more granular deduplication rate. As a result Datos IO claims a 10X bidirectional move efficiency over other deduplication methods.

New in RecoverX 2.5

The latest release of RecoverX, version 2.5, continues Datos IO’s pattern of twice a year updates. In this release, they have advanced the recovery capabilities of RecoverX. Recoveries can now be driven by SELECT queries to recover specific columns and rows from a backup. Queryable recovery allows more accurate and specific recoveries which reduces recovery time and space requirements. It has the added benefits of removing sensitive columns from the recovery process.

Recovery is enhanced with incremental recovery. While there are many techniques to limit the amount of data being backed up, there are few to limit how much data is transferred in a restore. Incremental recovery allows the recovery of data between two point in times instead of just one. Not only does incremental recovery restore less data it also establishes an archive capability

A database archive can be established by telling RecoverX to “restore” data to an archive storage area. Remember that RecoverX supports a wide variety of storage platforms and stores data in its native format. After the initial “restore” is complete, RecoverX can be directed to only “restore” new or modified data since the last archive job ran.

RecoverX 2.5 also adds the capability to backup from anywhere and to recover to anywhere. The customer benefits from local backup of geo-distributed clusters as well as local recover for improved recovery time objectives. It also eliminates the need for shared storage across sites.

Lastly RecoverX 2.5 improves its encryption, authorization, authentication bringing it to enterprise grade. The security improvements includes TLS/SSL encryption, LDAP authorization and Kerberos authentication.

StorageSwiss Take

Datos IO is in one of the most influential positions of any of the recent startups we’ve spoken. The modern IT stack is distributed across nodes and sites, very few, at this point no one except Datos IO is addressing the unique data management need. Its uniqueness is reminiscent of Veeam’s rise as virtualization went from science experiment to production requirement but unlike Veeam, there appears to be a higher barrier to entry to modern applications data protection market.

For companies rolling out modern applications Datos IO is not only worthy of strong consideration, other than trusting the cluster, there may be no other choice.

About George Crump

George Crump is the Chief Marketing Officer at VergeIO, the leader in Ultraconverged Infrastructure. Prior to VergeIO he was Chief Product Strategist at StorONE. Before assuming roles with innovative technology vendors, George spent almost 14 years as the founder and lead analyst at Storage Switzerland. In his spare time, he continues to write blogs on Storage Switzerland to educate IT professionals on all aspects of data center storage. He is the primary contributor to Storage Switzerland and is a heavily sought-after public speaker. With over 30 years of experience designing storage solutions for data centers across the US, he has seen the birth of such technologies as RAID, NAS, SAN, Virtualization, Cloud, and Enterprise Flash. Before founding Storage Switzerland, he was CTO at one of the nation's largest storage integrators, where he was in charge of technology testing, integration, and product selection.

Tagged with: Archive, Backup, Big data, Cloud, Corruption, Datos IO, Deduplication, Encryption, Ransomware, Replication, RTO, Scale-Out, Virtualization
Posted in Briefing Note