Data Profiling Grows Up – Index Engines 5.1

Data profiling allows organizations to report and analyze file and email content in order to streamline their data center. By non-disruptively scanning NAS and server storage repositories, backup tapes and archives such as SharePoint, IT planners can gain deep insight into the relative business value of all of this stored content. Armed with this information, businesses can then make better decisions about which data sets to delete to reclaim storage assets, when to migrate aged data to more cost-effective storage tiers, more easily identify sensitive data that should be encrypted and which data sets to preserve for compliance related purposes.

Automated Data Insight

Index Engines has been helping organizations over the last decade organize data center assets to make them more accessible, searchable and easier to manage. Their Catalyst Data Management Platform has taken these capabilities to the next level by building a full data profile of all the scanned information. Deployed as either a hardware appliance on the network or as a virtual appliance within a hypervisor, the Catalyst platform scans target storage resources and provides IT managers with a variety of reports on the disposition of data. For example, these reports will break down which user owns the data, when it was last modified, size of the file and how many versions of that file exist across all the various storage resources in the environment. It can also optionally provide content level indexing of the information within the file.

As we discussed in our recent article “Can Data Profiling Solve The Data Epidemic”, data profiling empowers IT managers to in effect, conduct their own storage assessments to scan data sources, extract unwanted information and automate the movement of data across storage tiers to enhance efficiencies. It is also an enabling tool for preserving data for legal preservation, as these scans do not make any modifications to metadata attributes like last access times, etc.

Enhanced ACL Scanning

Index Engines recently announced some important new features and capabilities with the release of their 5.1 software version. The first of these features is much deeper integration with Windows Active Directory. Now it is possible to profile and report on content according to specific groups within the organization. For example, data profiling scans will now not only sort by active and inactive (ex-employee) user groups but it will also categorize file ownership at the departmental level – manufacturing, R&D, Human Resources, C-Level suite, etc.

In addition, it is now possible to view Access Control Lists (ACLs) to determine which users have permissions to read, write or browse data at the file level. This is important for validating whether certain employees should have or should not have access to sensitive business documents. By producing this information, data security personnel can take the appropriate action to grant or deny file access based on corporate policies.

As equally important, these scans provide organizations with better insight into sources of “abandoned data” or data that was left behind by an employee that no longer works for the company. This data can then be purged or preserved based on the attributes of the file content.

Exchange and SharePoint Data Intelligence

Another enhancement to version 5.1 is its support for Microsoft Exchange 2013. The ability to scan email is a fundamental component of any eDiscovery legal search. Finding and preserving specific user mailboxes can be performed via the web based interface, with retention policies set so data is released when the legal hold expires.

Any data that is profiled and deemed to have no business value can easily be purged from the network including finding repositories such as legacy email PSTs. The technology also tracks file deletions and enables organizations to defend those deletions based on corporate data governance policies and statutory mandates.

One of the major upgrades to Catalyst 5.1 is its ability to index SharePoint data. As a platform that is used in part to facilitate document collaboration amongst users, SharePoint file repositories have quickly become bloated. This not only consumes primary storage resources but it can also impact SharePoint application performance. Version 5.1 can crawl through SharePoint file directories and index and profile all the content that exists, including all the various versions of the same file.

Efficient SharePoint Indexing

A key benefit with using Catalyst to report, profile and extract content in SharePoint repositories is that it doesn’t require for an external copy of the data to be created first. Other tools require a full dump of the data to a secondary storage resource before any action can be taken on the data contents. This additional administrative step takes time and requires additional disk resources to conduct. With Catalyst 5.1, however, SharePoint content can be searched and indexed directly from the production copy. This saves organizations time and money when managing this archive.

The other challenge with creating a separate copy of SharePoint content for indexing is that when data is moved in this manner, it typically changes the file modification and access times. This is a major issue when data needs to be preserved for compliance and legal holds. Since SharePoint data can be scanned and indexed in place by Catalyst, without a physical move of the information, IT planners can manage, prune and clean SharePoint data while maintaining defensibility of the process.

Breaking Backup Vendor Lock-In

Perhaps one of the most popular Catalyst business use cases is the ingestion of backup catalogue data to allow data center managers to manage legacy data residing on extensive tape backup archives. Often times, organizations want to migrate off a legacy backup application and over to a newer, more feature rich enterprise backup application. In the case of a legal request, the challenge is there is no practical way to access all the data that may have been accumulating for years without the requirement to maintain an instance of the legacy backup software. This tends to severely limit the ability of organizations to make a change.

Catalyst 5.1 now supports the ingestion of Symantec NetBackup catalogue data so that all backup data written to tape in NetBackup’s proprietary format can be discovered, indexed and extracted. Storage administrators can search the NetBackup catalogue, find specific content and extract it off of tape without having the original software in place. This allows IT decision makers to retire NetBackup and move to alternate backup applications. Index Engines supports the same capabilities with IBM Tivoli Storage Manager (TSM) today with additional support for popular backup software vendors to follow.

Dashboard Driven Data Insight

The last enhancement to Catalyst 5.1 is the ability for users to create customized dashboards. Users can now create a custom view of reports that will refresh automatically. Some examples include aged data reports, a report on large files, personally identifiable information (PII) reports and PST location reports. In addition, it is possible to create federated reports based on aggregated information collected over several geographic locations. For example a single, global report on PII can be created for data scanned and indexed across offices in New York, London and Tokyo.


Index Engines has broadened their automated content indexing and profiling support across a wide range of applications and IT infrastructure. Large sources of data content like SharePoint and Exchange can now be efficiently scanned, cleaned and preserved to support data remediation and legal hold initiatives. This is in addition to the native ability of the Catalyst platform to scan NAS, file server, user devices and tape backup storage repositories.

For SharePoint environments, Catalyst’s ability to scan and index SharePoint data directly on the storage repository where it resides, without requiring a data dump to a separate storage area, is a big advantage over other solutions which require data to be copied out first before it can be scanned. Furthermore, Catalyst 5.1’s deep integration with Windows ACLs provides more fine-grained control of data access management to help ensure that corporate data is secure from unauthorized users.

In short, the Index Engines Catalyst Data Management Platform is more than just a tool for enabling compliance and legal hold requests, it is a system which can be used as a day-to-day tool by IT data managers to drive improved storage efficiencies throughout the enterprise.

Index Engines is a client of Storage Switzerland

Click Here To Sign Up For Our Newsletter

As a 22 year IT veteran, Colm has worked in a variety of capacities ranging from technical support of critical OLTP environments to consultative sales and marketing for system integrators and manufacturers. His focus in the enterprise storage, backup and disaster recovery solutions space extends from mainframe and distributed computing environments across a wide range of industries.

Tagged with: , , , , , ,
Posted in Article, Product Analysis

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 21,783 other followers

Blog Stats
%d bloggers like this: