One of the hardest things to protect is a server with millions of files on it. In the early days of my career, we specifically asked customers if they had such a server so we could develop special procedures to handle it as part of the data protection design. Eventually technology advanced and thanks to image backups and block level incremental backups the problem subsided. But now a new type of data generated by machines, is appearing in data centers and that data promises to kill backup.
In this StorageShort, Douglas Soltesz and I discuss some of the challenges protecting Machine Generated data.
While machines can create all kinds of data, the type that is the most concern is data that sensors, the Internet of Things (IoT) and log files from the various systems that an organization has in the data center. Most of the data these systems create is text files. The data is compressible but it is unique to a given time and location so it can’t be deduplicated. It also can’t be recreated. If you lose my heart rate data for yesterday, you can’t get yesterday back.
Let’s use a simple real world example. I’m training for another triathlon and my fitness watch captures all sorts of data while I’m running, biking or swimming. Data points like distance, steps, heart rate (every second), GPS data, and a lot more are all captured. On a long run or ride the size of this text file can be 10MBs or more. Yes, I actually export all the data from my watch and send it to another program so I have a backup of it. I have highly available and redundant fitness data in two clouds. But imagine the companies that store this data and provide me an interface to see it. They each have millions of users, and while not all are training at my level, they are tracking heart rates throughout the day. How are these guys protecting it?
Not only is there a lot of this data it has value over the course of time. Again, using my training use case. I am constantly comparing runs, rides and swims from this year to several years ago. I want the company storing it to keep it forever (or at least a really long time). And thanks to applications like Hadoop, Splunk, Spark we have the ability to process and reprocess this data. Each of these companies is continuously improving their software to provide me with better training data. That is a competitive requirement, if they don’t I’ll switch.
How To Protect Machine Data?
Given that keeping this data is a competitive requirement protection of this data is critical. The problem is that most backup software and hardware is ill-suited to protect machine generated data. The thought of having a backup application walk a file system, see which of these trillions of files is new and then copy them to a backup appliance seems like a recipe for disaster or at least really long backup and recovery windows, especially in a world that is becoming more real-time by the moment.
This is a situation where an object storage system makes a lot of sense. Store the data directly on the object store, have that data either replicated or erasure coded both on-premises and across sites. Object storage can handle the file count and make sure that data is protected and preserved for decades. To learn all about the limitations of Backup Hardware watch our webinar “Four Reasons Why Your Backup & Recovery Hardware will Break by 2020” now available on-demand.