In a recent blog my colleague W. Curtis Preston used the movie “Rogue One: A Star Wars Story” as an excellent example of why organizations should take great care in how they manage data archives. In that blog he suggests, “if you don’t have a reason to store data for long periods of time, don’t do it.” Sage advice. The problem is knowing which data you can delete and which data you want to delete.
The Classification Conundrum
To make sure you delete the right data at the right time, you need to classify ALL of your data. If you can master data classification then you can and should follow Curtis’ advice. Some companies believe it is easier to just store all data forever, or at least for a very long time.
An “all data forever strategy” is much easier than continually classifying data. But, if you decide to store all data forever (or even for a really long time) then you need to fulfill certain responsibilities. Curtis’ blog gives an excellent rundown on those responsibilities; put it in an archive not a backup, keep a copy of your archive offsite, keep it secure, and protect it.
Can We Keep Data Forever?
Architecturally it is possible to design storage systems that can keep data forever. Modern day object storage systems already scale to virtually limitless capacities. They also can continuously verify data to make sure that it is readable and immutable. Most systems also have the flexibility to integrate new hardware types as they come to market, as well as gracefully retire old hardware. The cost to store data for a long time is increasingly affordable, but it is not free. You should always weigh the true costs of storing data for a long time against classifying data and storing the appropriate data types for the appropriate lengths of time.
While a forever strategy will need some sort of a data refresh mechanism to make sure old data is readable and can be converted, most data formats are becoming far more durable. Old versions of formats like TXT, PDF, JPG, PNG and others can be read for decades after creation. Obviously proprietary formats need to be converted to something more standard or get periodic refreshing.
One of the key points of Curtis’ blog is data you retain must be in an archive and not in a backup. I agree 100 percent! Ideally you should use backups ONLY to recover the most recent version of a data set, everything else falls into the archive process. The archive should also make sure that there are is only two copies (one on-site and one at the DR site) of the data.
An archive enables the other aspects of a forever strategy that are critical for success, most importantly “finding data”. To Curtis’ point, you don’t want a discovery request to bury your organization. You want to dispense with the information quickly and you want to provide only the information you need. Most email archive systems do an excellent job of this. Unstructured data archives are becoming much better at creating their own metadata as well as supporting user or machine generated tags. The amount of data in that archive should not impact the ability to respond to a request for data. That is the purpose of indexes and tags – to find data quickly.
Searchability is the reason you should not use backups for this purpose. Performing a single electronic discovery request against years of backups will cost your company far more than whatever an archive system would have cost you. Do not store backups for multiple years – period.
Should We Keep Data Forever?
If we have the ability to keep data forever the bigger question is, “Should you?” Curtis gives some excellent reasons why you might want to delete data as soon as you responsibly can. The problem is that “delete” is a one-way street. Once you hit delete data it is gone. Forever. And Curtis cites some excellent reasons why you might want to delete data as soon as you can. First, what if there is a weakness in something you designed and you don’t want that weakness found out by someone else?
Spoiler Alert – If you haven’t seen “Rogue One: A Star Wars Story” you may want to stop reading here and go see the movie.
Curtis’ example was a Death Star. Instead of deleting it, what if the empire kept that data and kept running simulations on it to see where the weaknesses were. They could have fixed the problem and Luke Skywalker may have been rendered an irrelevant footnote in the Empire’s march to domination of the galaxy.
A more earthly example may be a harassing email. In every case I have come across, an organization facing this situation is presented the email from either a printed copy of the email or a copy of the email that is forwarded to a personal email account by the offended party. Deleting the email does not help the organization because there are other copies of that email in other locations. In fact, deleting the email from the organization’s server makes it look bad if news of that email gets out from other sources.
I’ve also seen situations where a good email archive policy found an employee who claims to be offended actually forwarded that email to friends with the header “ha ha isn’t this funny.” If the organization had an aggressive deletion policy, then it might have deleted the very email that would have saved it in a discovery effort.
There is risk with this approach, of course. That single request for an email may shed light on a much larger investigation. Anytime you are storing data, there is risk.
The reality is that the world in which we live has changed. If you do something wrong, personally or organizationally, and there is a digital documentation of that “wrong”, it will more than likely come back to haunt you. And my opinion is that deleting the offending document only gives the organization a false sense of security. My witnesses of this includes the NSA, the IRS, Sony and several presidential candidates. The better strategy, in my opinion, is for an organization to keep what it has, understand what it has done and address mistakes quickly and transparently.
Finally, the space limits of a blog entry don’t allow me to get into all the aspects of saving data for more positive reasons like Big Data Analytics. These initiatives count on using old data to make new and smarter decisions.
Keeping all data forever is a risky strategy but so is deleting data. In the end it is up to the organization to decide if it wants to create a defensible deletion strategy or a “keep it forever” strategy.
What Curtis and I both agree on is you need to take the management and preservation of your data very seriously; more so than ever. Also, data should move into an archive very soon after it becomes inactive. That archive needs to be scalable, protected and flexible for the future. Most importantly, it needs to be searchable for inexpensive identification and delivery of that data. If you do choose deletion over long term preservation of your data, do remember this: it’s not gone just because you delete it.