Most file system management tools – whether they are for reporting, governance, backup, archiving, migration, cloud bursting, or content classification – start developing performance problems when the total number of files and directories exceeds a hundred million. The performance challenges these products face is understandable because it is only recently that enterprise NAS systems and HPC file systems could reliably handle more than 50 million files. In other words, there was little to no demand for software tools that could scale to manage billions of files. As a result, organizations and data management software solutions now face the billion file problem (BFP).
One of the accelerators for the sudden increase in total file count per file system is enabled by NVMe flash. Vendors and file system architects can now store file system metadata on flash enabling directory trees to grow into the billions, without compromising performance or reliability.
What kinds of sites actually have billion of files and want to store them in one file system? This kind of scale can be found in large enterprises, industrial R&D, scientific research, most AI or machine learning workloads, media-entertainment, and financial services among others. In short almost any organization of any scale is suffering from BFP and with the introduction of Internet of Things (IoT), 5G and Edge Computing, the problem is only going to get worse, and more widespread.
The lack of a full-featured data management tool often sends organizations with BFP down the dangerous path of designing something themselves. Many organizations in high-performance computing, cloud, and academia create their own robust data management tool for their specific use case. While this approach often solves the immediate need, it also creates a long-term problem if the developer of the system leaves the company or is busy on other projects.
Organizations with BFP need a data management tool that can not only move old data from one storage system to another, but that can also leverage and incorporate all of their different storage platforms used to store the billions of files. Foundational to the process is a global file-system that simplifies access. The solution also needs to provide a dashboard that can manage file system usage and file data. It needs to extract information from metadata as well as enable custom tags to help find data. It also needs to provide information on the files in the environment, like whether any of the data contains personally identifiable information.
Introducing Starfish Storage Software
Starfish Storage is a software company founded in 2013 to solve the Billion File Management problem. Unlike many startups that try to pick a market and build a product for it, Starfish’s founders were in the trenches of trying to create architectures designed to handle millions or billions of files. The Starfish Storage solution has more data management and protection capabilities than can be covered in a single briefing note. At its core, though, is an architecture built for scale and flexibility. It’s *FS is a Universal File System Namespace that federates disparate storage hardware into a single namespace. Data can move easily between the various storage solutions in place thanks to the single namespace.
Simple use cases of the Starfish Storage solution include a policy engine to move older data from primary storage to less expensive tiers. It can also identify potentially corrupted files and even attempt to fix those files. It provides content classification and custom tagging of files to make it easier to find files in the future.
The solution is made up of three primary components. First, there is a database that synchronizes with file system metadata from the organization’s POSIX file systems. Again, administrators or users can add additional metadata as needed. The second component is a jobs engine, which leverages the information in the database and takes action based on policy. The jobs engine can copy, move, archive, and delete data. It can also calculate hashes to verify data integrity and uniqueness. The jobs engine is designed for large scale. Any number of agents can divvy up the working set and complete the job in parallel. The third component is the user interface, which is an HTML5 file system browser for administrators. A user portal is also in beta.
Starfish Storage claims that users can be up and deriving value from their file systems within 15 minutes of installation. It benefits organizations by addressing extreme-scale requirements. StarFish claims installations ranging from one million files to 10s of billions. The software allows organizations to find, report on, and track their files to automate workflows and reduce costs. It leverages inherent metadata, extended metadata, and the customer’s custom metadata to fine-tune search and automation. The solution can migrate, sync, and archive data across dozens of storage systems and public cloud storage.
StorageSwiss Take
Organizations with multiple millions of files ranging to billions of files have a series problem. It is not enough for IT administrators to store this data. They need to make sure that it is stored in the most cost-effective way possible by moving old data to less expensive storage tiers and by eliminating the future duplicate file crawl that is bound to occur. They also need to make sure the data is protected, but most importantly, they need to make sure users can find data when they need it in the future. If data can’t be found, there is no point in storing it. Starfish provides the ability for IT administrators to deliver all of these capabilities to their organizations.