Where traditional structured databases are seeing a reduction in growth, massively scalable databases that often run in the cloud are taking off. But like so many advances in technology, scale out databases like MongoDB and Cassandra are solving one group of problems and creating another. Specifically, it is very hard to backup these databases for multiple reasons. Consequently, that’s forcing customers running these databases to choose from a number of substandard backup methodologies.
The most common method for protecting such databases is multi-site replication. All data is replicated to at least two locations, so it is believed by some that it doesn’t need backup. Unfortunately, this protection method only protects against hardware failure and doesn’t protect against the most common problem — human error. If someone deletes a table, replication will only replicate that deletion. This is why replication by itself is not considered a valid backup method.
The most common method for actually backing up such databases is to run a node-level backup of the data on one of the nodes of these multi-node systems. The problem with this method is it only provides the ability to restore a node, not the ability to restore the database if something were to happen to multiple nodes or to the database as a whole, such as someone deleting a portion of it accidentally or maliciously. In addition, backing up the database this way requires a two-stage backup process where data is first backed up to a staging device and then backed up via the backup system.
Another typical method for backing up any database is to use a snapshot. First you tell the database that you are performing such a snapshot (e.g. alter database begin backup in oracle), and then you take the snapshot. The challenges with this in the scale out database world are two fold. The first is the multiple nodes making up the database probably do not share a single storage device so a single snapshot is not possible. The bigger challenge is that these databases operate on the eventually consistent model, also known as optimistic replication. Changed data is replicated between nodes, and eventually all accesses to a given item will return the last updated value. However, for some period of time there will be some parts of the system that will return the previous value for an item. If the different nodes don’t use the same storage and the storage is never fully consistent, how does one contain a snapshot of it? The answer is that you don’t.
If using either of these backup methods for a recovery, the database would have to go through a long recovery/repair process to address all of the referential integrity issues the restore would create. This is the state of the art for scale out database backup. Estimates are that such a repair process could take weeks to perform for a typical size database.
Datos IO aims to fix that. It uses database APIs to track all of the changes to a given item. In simplest terms, this means they track that the value of a given item was changed from 5 to 9. They do not track how or why it was changed to 9. But they watch the changes that the database makes across the entire cluster, and are therefore able to replicate those changes during a restore. If you restore from one of their backups, no repair process is necessary. It stores the changes in a deduplicated backup format.
Datos IO is starting with MongoDB and Cassandra, but has already announced its intentions to support most of the databases and applications that run in the cloud, from MySQL, Hadoop, and even Salesforce and Google Apps. This method of backing up a database by simply watching its changes is an interesting one, but Datos IO believes it is both scalable and adaptable for all of these applications.
MongoDB and Cassandra were too busy being awesome to bother with trivial things like backup, creating the need for something like Datos IO. If they are able to execute on their claims and goals, they will definitely be the first to fill this particular need.