Backing up MongoDB and Apache Cassandra

Posted on November 2, 2016 by wcurtispreston

Datos IO is the first product designed specifically to meet the cloud-scale backup and recovery needs of modern, scalable, non-relational databases such as MongoDB and Apache Cassandra (DataStax), and cloud-native databases such as Amazon DynamoDB, Microsoft DocumentDB and others. These databases, and the cloud-native applications, such as analytics, IoT, and eCommerce written on them, don’t play by the same rules as we are used to in traditional data protection. This product analysis will therefore first cover how and why these new-age databases are so different from traditional relational databases, followed by an analysis of how Datos IO is addressing the problem.

The growth of modern databases

The verdict is in: mainstream SQL databases are out, and modern, non-relational databases such as NoSQL, Key-Value, Graph, Cloud, and others are here to stay. Major database companies experienced negative growth in the last year. Oracle’s revenue growth was -4.3% and IBM’s growth was -10.6%.

If people are not buying Oracle, SAP, or IBM, what are they buying? Many believe that it starts with many companies deploying non-relational databases such as MongoDB and Cassandra to support big data and web-scale applications. For example, DB Engines stats shows that new-age data stores are all in top-10 category, which would explain both the decrease in revenue growth for traditional database vendors as well as the increase in revenue for cloud companies, since the majority of these next-generation databases will run on services-based infrastructure such as like Amazon AWS, Google Cloud and Microsoft Azure.

Backup Status Quo

The challenge with these modern database systems is the simple matter of backup and recovery – protecting these databases from deletion or corruption seems to be a low priority item in the development lifecycle. This is another example of a new type of product that is changing the world as we know it, such as what VMware did for server virtualization many years ago. Unfortunately, just as VMware initially had no way to easily backup a virtual machine, there are no good backup and recovery options built into these next-generation databases either.

Consider a MongoDB sharded cluster, where the problem is summed up best by quoting from the MongoDB manual, “To capture a point-in-time backup from a sharded cluster you must stop all writes to the cluster. On a running production system, you can only capture an approximation of point-in-time snapshot.” File system snapshots and mongodump are two supported methods to backup a sharded cluster, but neither method offers the type of consistent backup that traditional database backups do. Since updates continue as a backup or snapshot occurs, each part of the backup will come from a different point in time, creating referential integrity issues. MongoDB says customers can use the journal to make the database consistent in a recovery. The reality is that this recovery process can take many days to complete — all resulting in application downtime and ultimately loss of business.

The only way to get a completely consistent backup is to use the db.fsyncLock call that halts all writes to the database. Besides the fact that this stops any updates while the backup is running, it also “may block reads, including those necessary to verify authentication.” The solution to only lock a secondary member of each replica set and then backup that member. The problem is that replicas are typically updated via asynchronous replication that only supports eventual consistency. In other words, there is no supported way to get a consistent backup of a running MongoDB database with the included tools.

There does appear to be an option for those running their database on the MongoDB Cloud Manager or Ops Manager. It works by creating an initial replica of the database and then updating it by watching the oplog, a log of semantic changes to the database. One issue with this is that everyone does not want to host their database in MongoDB’s Cloud Manager or pay for the Enterprise Advanced subscription so they can use Ops Manager. (It costs three times the cost of the basic license.) Another issue is that each option only supports backing up locally; the cloud version backs up to the cloud and the on-premises version backs up to local storage – it can’t back up to the cloud.

Things are similar or worse for Apache Cassandra (DataStax) users, where its manual admits that its backup method would provide an eventually consistent backup that will then use Cassandra’s built-in consistency mechanisms to repair itself. Again, the problem with this is that this repair process could take many days or even weeks to return the database to a consistent state — all resulting in application downtime and ultimately loss of business.

Datos IO

Datos IO is the first multi-database platform backup option for such non-relational databases that addresses all of these concerns. Using published APIs, it creates a backup copy of all your key/value pairs stored in native format you can use to restore portions of your database or the entire thing without any referential integrity issues that creates the need for a long consistency repair process. The system can recover the most recent version or any previous version of the database stored in the system. Knowing the system would back up some of the world’s largest databases, Datos IO built it from the beginning using a scale-out design that can run in the cloud or on-premises, and back up to the cloud or on-premises. Datas IO claims that its dedupe saves ~70 percent on storage costs due to their purpose-built and patent-pending semantic deduplication functionality. One reason for the cost advantage is it only charges for protected terabytes, not how many nodes you protect. Datos IO understood that customer node count could go up or down, but data is one variable that can be reliably quantified over time.

Datos IO Consistent Orchestrated Distributed Recovery Engine (CODR) runs application listeners that monitor the semantic changes to the database. It captures the changes through publicly available APIs to talk to things like the MongoDB Oplog or Cassandra SStables. The first backup is done by specifying how far back you would like to examine the logs, typically something like two weeks. Datos IO informs us that this is consistent with how people use such databases, where it specifies very short time-lo-live (TTL) values for the data, and key/value pairs that are older than a few weeks, which the system automatically purges.

Since it is storing the changes to the database instead of trying to capture the storage holding the changes, it does not have the referential integrity issues mentioned above. Datos IO advertises this solution for enterprise use cases of operational recovery, automated refresh of test/dev, and migration and updates.

StorageSwiss Take

It’s amazing, but not the first time (nor the last), that a company would come to market with a product that doesn’t have a good way to back itself up, and yet that appears to be the case for many of these ultra-scalable non-relational databases. All native methods built into the database products have major limitations, the biggest of which is a very lengthy recovery process that could take days or even weeks for a large cluster. It’s good to see a company addressing this need with a novel data protection solution designed from the ground up especially for its needs.

Datos IO RecoverX addresses the scale, referential integrity, recovery speed, and storage cost issues of the built-in options. We look forward to seeing them expand from their initial Cassandra and MongoDB offerings to other applications such as Apache HDFS, Apache Hive, BigTable, DocumentDB, Apache HBASE, and Amazon DynamoDB. While Datos IO has yet to announce when it will release support for these extended platforms, they are all listed as the products Datos IO hopes to add as soon as it can.

Sponsored by Datos IO

About wcurtispreston

W. Curtis Preston (aka Mr. Backup) is an expert in backup & recovery systems; a space he has been working in since 1993. He has written three books on the subject, Backup & Recovery, Using SANs and NAS, and Unix Backup & Recovery. Mr. Preston is a writer and has spoken at hundreds of seminars and conferences around the world. Preston’s mission is to arm today’s IT managers with truly unbiased information about today’s storage industry and its products.

Tagged with: Backup, Cloud, Data Protection, Datos IO, NoSQL, Recovery, Replication, Scale, Snapshot
Posted in Article