In most data centers, unstructured data now consumes more storage capacity than all of the organization’s structured data combined. Yet organizations still often treat unstructured data like a second-class citizen when it comes to data protection. Because of its size and the sheer number of files, organizations tend to protect their unstructured data store with legacy backup solutions and outdated best practices. As a result, most unstructured data protection strategies fail to meet the current requirements for compliance, retention and multi-cloud support.
Unstructured Data Protection Needs Built-In Compliance
Other than email data no other data set is the target (or source) of regulatory and legal requirements like unstructured data. It is necessary to identify, segregate and in some cases set aside unstructured data to meet an ongoing statute or a new legal hold requirement based on a discovery request.
In theory, the organization should have a separate archiving process to support these complaints and demands, but the reality is they don’t. Establishing a separate archive has proven itself to be expensive and complicated to adhere to over a long course of time. Organizations have been slow to adopt archive because of its requirements for a separate silo from backup, as well as the specialized storage and software needed to achieve the archive system’s goals.
In practice, most organizations count on the backup solution to be the archive. Trying to extend backup to be the organization’s archive, especially with legacy software, creates even more challenges. The design of most backup systems does not meet compliance standards. They have no way to classify data by category or to make a dynamic backup of a particular data set to meet a legal hold. They also have challenges at scale. A backup solution may need to store metadata information about millions if not billions of files, and each version of those files. This combination leads to a backup database of massive proportions, which is susceptible to corruption and presents a backup challenge of its own.
Closely related to compliance is retention. However, the retention use case is broader. Organizations may decide to retain information for reasons other than compliance. There is, of course, the “keep it just in case” use case that leads to rarely deleting files from a file server or NAS. But, there is also the legitimate need to retain information for possible future data mining needs. There is also a need to ensure the verification of retained data to ensure it does not degrade over time.
The problem is the organization has no idea exactly what information needs to be retained and for how long. Many organizations retain all data both on primary storage and on backups. The challenge is particularly problematic for protected data sets since again, it stores multiple copies of the files and multiple copies of each version of the files. As a result, protection storage is often 5 to 10X the size of primary storage and consumes large amounts of data center floor space, as well as organizational budgets.
Filling the Compliance and Retention Gap
There is a need to fill unstructured data’s compliance and retention gap. Governments and ruling bodies are passing specific laws and regulations around data governance. Counting on an organization to adopt and implement a separate archive strategy is too optimistic. Archiving solutions have been available for decades and their adoption rate, especially compared to data protection, is too small to measure.
Data-Driven Backup
The mantra has always been “backup is not archive,” but given an organization’s willingness to invest in data protection versus archive solutions, it may be more pragmatic to improve backup so it can fulfill two of archive’s most important responsibilities; retention and compliance. Most backup solutions are “job-driven” in that they backup a given mount point without regard to the type of data within that mount point.
Since most file servers and NAS systems have a wide variety of data types in them, each with their own compliance and retention needs, it makes sense to change backup from “job-driven” to “data-driven”. While a data-driven backup could just protect a file server or NAS as a whole, it can also be designed to backup data by type. A data-driven backup will require a classification capability so it can organize data by type and/or location. The backup will automatically create the classes based on file type or directory location, or organizational requirements can manually tag items.
With the tagging in place, the backup process can protect the file server or NAS and each protection pass is organized and performed according to these tags. The tags can have specific retention and compliance requirements associated with them, allowing the organization to meet both internal and external standards quickly and easily.
StorageSwiss Take
It’s undeniable that unstructured data is growing in almost every data center. Since this data is now useful for many purposes, it needs to be stored for a long-term period. It is ironic that the way of protecting and storing data has not changed. Unstructured data protection may require a fresh approach; one built around the protection of actual data as opposed to the servers on which that data resides.
Sponsored by Aparavi