It comes as no surprise to any IT professional that unstructured user data is growing. What may surprise them is the rate at which that growth is occurring and the lengths of time that data needs to be retained. The good news is that, at least so far, the storage industry has been able to keep up with the capacity required to store this data increase – almost to a fault. But that “good news” is also bad news as now potentially PB’s of information is stored in data centers and there is virtually no understanding of the underlying value, importance, or risk of storing all this data.
Traditionally, an IT team may reach out to a storage vendor or storage integrator for a storage assessment engagement. The problem, other than the expense, is that storage assessments are often a one-time event the vendor uses to sell you more capacity and the data captured is rendered meaningless a few weeks after it’s created. Instead, a simple but ongoing process needs to be developed that provides better insight into the information stored across the enterprise so that the cost to store data can be reduced, the value of that data can be realized and the risk associated with that storage can be mitigated. When a storage assessment becomes an integrated process we call it “Data Profiling”.
What Is Data Profiling?
Data profiling is the process by which unstructured user data residing in a variety of sources can be efficiently scanned, indexed and classified so that the appropriate actions can be taken against that data. It is not a part of backup or archive processes, although it can certainly augment and improve those processes, since it provides organizations a way to streamline how they ultimately manage and retain this information. It has significant value over a one-time storage assessment. The results of ongoing data profiling produce a living database of information that allows data to be better managed and better mined for future value and data center costs trimmed and controlled throughout the year.
Data profiling is different than a data assessment which is typically performed by a storage vendor or integrator. The goal of most assessments is, essentially, to sell more storage to the end user. Data profiling becomes an integrated part of the data center’s storage management processes and is designed to reduce storage investment costs while increasing data intelligence.
How Data Profiling Works
Data profiling, available from vendors like Index Engines and their Catalyst Data Management platform, will typically be implemented into the data center via a dedicated appliance or run as a virtual machine within the virtual infrastructure. It then scans a variety of sources, including network shares, email stores and document collaboration databases (like Sharepoint) to categorize data by various metadata attributes like creation and modification dates, individual and group ownership as well as a detailed context index, if desired.
A data profiling process is usually implemented long after TBs of information are created and stored on a variety of storage repositories. It is important then that the data profiling solution be able to rapidly scan and ingest data stored on NAS shares, as well as be capable of scanning and indexing off-line copies of data, like those stored on legacy tape media created by backup and archive applications. If scanning, ingesting and indexing of information can be done quickly, this also allows for continual updated scans of the environment to be performed. This allows better decisions to be made on data contents since they are always based on the latest information.
Data profiling uses the latest technology that allows efficient indexing and metadata extraction. This allows networks to be scanned with extreme speed and efficiency. The resulting data index of metadata is small and compressed so as to not require significant storage capacity.
A data profiling solution should also provide defensible data movement and deletion. Many of the tools used in today’s environment will copy and migrate data and corrupt the integrity by changing owners to “admin” and last accessed date to “today”. Using reliable data profiling, an audit trail of each data removal or archive step is produced that indicates who and why data was altered in the environment. With this information recorded, organizations can explain their actions when needed even if it is years after it occurred.
The Data Profiling Pay Off
Armed with continuously updated information about the enterprise’s online and offline data, organizations can make safe and accurate decisions about the data they are storing. The most obvious goal is using data profiling to reduce the cost to store the organization’s information assets. This can be done in two ways. First, redundant copies of data can be identified and safely removed in a defensible fashion. Secondly, data that has long outlived its business value can be outright removed from the environment; again in a defensible fashion. This may be data that is old and never accessed or data that is abandoned, having been created by users that are no longer with the organization. Redundant data can account for up to 30% of utilized storage capacity with abandoned and aged data consuming an additional 40%. What organization wouldn’t want to migrate this content off of expensive primary storage systems and skip a year of capacity upgrades?
The second benefit is that the organization is now better prepared to handle litigation data requests. They can respond rapidly to eDiscovery and regulatory requests and be more capable of managing retention requirements. When a discovery request comes in or a new data compliance policy is set, a compliance manager or even IT can search across data assets simply and rapidly to identify the data impacted by that change or request. Data profiling in effect brings a “Google” like search interface to the large majority of the unstructured and semi-structured data that is stored in the environment. Finding “John Doe’s” mailbox or a four year old contract is now a quick and painless process versus a disruptive and expensive proposition.
The third benefit of data profiling is how it improves other processes in the data center. Primary storage is enhanced because now only the data routinely being accessed is maintained on that storage resource. This means that the amount of premium priced storage that needs to be purchased should be reduced significantly. The archive process should also benefit significantly as data can be moved to the archive based on value and not just the rudimentary “last modified date” parameter. Cost effective cloud storage can be leveraged and populated with data that is aged but should be kept for longer term retention. Discretion can be applied by data owner, individual or group, as well as specific content within the document.
The backup process, however, may see the most benefit from a data profiling implementation. For example, the knowledge provided by data profiling instills greater confidence in the overall archival decision making process. Not only is the right information being archived but that archive action is being fully documented. With that data removed from primary stores, far less data, both in terms of size and number of files, needs to be protected and the backup window can shrink significantly. Additionally, redundant data stored on disk and tape backup targets can be identified and either removed or consolidated.
Conclusion
The time for data profiling is now. The amount of unstructured data that organizations store is at an all time high, as is the importance to retain and re-use that data. In fact, Index Engines claims that on average 40 to 60% of data is misclassified and wasting valuable data center budget. Once properly classified data disposition strategies can be implemented and the content can be tiered, migrated or even defensibly purged in order to optimize the data center. The challenge is there may be a need for this data some time in the future. Data profiling allows data to be moved to less expensive media AND to actually be found when it is needed.
At the same time, the data requirements for litigation holds are still an important issue. The good news is data profiling software can now benefit from more powerful processors and networking speeds allowing for more rapid but more detailed content ingestion.
Since the paybacks on data profiling can be very significant, it is no longer necessary for the organization to wait for a lawsuit to justify an investment in the technology. Data profiling provides an immediate ROI with its improvements in primary storage efficiency, the enablement of the confident use of archives and the reduction in backup workloads and storage capacity. As a result, the improvements in these areas will allow the data profiling investment to pay for itself long before it is needed to respond to a lawsuit.
Index Engines is a client of Storage Switzerland

I am not sure if I understand where Data Profiling is applied in the life cycle of data. Is continuous Data Profiling working as a gateway that monitors ALL data access? Is Data Profiling a periodic scan that updates its indexes?
I think that Data Profiling will provide a ‘sharp tool’ that someone will need to wield. Specifically, who creates the workflow and disposition rules based on the metadata collected by Data Profiling? It is one thing to add something to the data storage system that allows more control, but I believe we will need a much better understanding of best practices before the value of Data Profiling can be realized. Otherwise, we have more information about our data, but we do not make good use of it.
I am a big advocate of what you here call the profiling of data and agree with the potential benefits. This is part of such an intricate IT ecosystem though that requires a set of rules around what to do with data that has been profiled.
Collecting metadata on information is not very valuable if the information can not be acted upon as a part of an integrated process. Once Profiled the data in the catalog needs to be acted upon by a set of rules that have been built upon an internal data taxonomy at the customer. These rules are probably complex and many. e.g., What do I do with data that is owned by Kevin, that has not been accessed in 2 years, is a legal document of type PDF and is over 10MB in size? I recently wrote a blog on this topic for another vendor in this space called Catalogic Software.
http://www.catalogicsoftware.com/en/Home/Blog/Catalogic-Software-Blog/February-2014/1-26-2014-Lee-Johns-Catalog-and-Manage
Folks like Index Engines and Catalogic Software are in a developing market space and it will be interesting to see the levels of automation and linkages into backup and recovery processes these vendors deliver over time.
[…] top read article, by far, recently is “What is Data Profiling”. My colleagues Eric Slack and Colm Keegan also just hosted a well attended webinar, “How […]
[…] feasible way to perform internal and continuous storage assessments is by the implementation of “data profiling” or data indexing technologies. These solutions can typically be deployed as either a hardware or […]
[…] we discussed in our recent article “Can Data Profiling Solve The Data Epidemic”, data profiling empowers IT managers to in effect, conduct their own storage assessments to scan […]
Data profiling is a top priority where I work. It seems as though too many new employees are confused about the difference between data profiling and data discovery.