Managing petabytes of data is fundamentally harder than managing terabytes. Getting management wrong could cost the company thousands or worse lead data lost that could cost the company millions. Problems that you can ignore or confront with brute-force techniques with terabytes of data become insurmountable obstacles when dealing with petabytes. The key to managing this level of data is to understand what your obstacles are and prepare for them with a focused, management approach.
The biggest challenge facing a lot of large data centers is the extreme growth of unstructured data. While structured data has grown in the last 10 to 20 years, the sources of unstructured data – as well as our ability to use said data – have increased significantly in just the last few years. One of the challenges facing unstructured data management is that it is not as easy is it used to be to decide when to delete it.
For example, it is very easy to select all financial records from a previous accounting year, archive them, and delete them from primary storage. It is not so simple to do this with unstructured data. One of the challenges is that unstructured data is often owned by myriad people rather than a unified application such as accounting.
Thousands of users and hundreds of applications may exist in your environment that create unstructured data, and since it is often unclear when a piece of unstructured data will become useful, no one wants to delete anything. The lack of specific accountability results in a significant portion of the data being inactive – taking up space for no purpose. As long as we rely on data creators to identify and migrate older, unused data, inactive data will always be a problem.
The problems with primary storage bloat get even bigger once we look at the backup system. Since most backup systems treat all unstructured data alike, it backs up very important data and inactive data with the same policies. It’s simply unable to do anything other than treat all data equally because the data is co-mingled. Many people do a weekly full backup followed by daily incremental backups with a retention of at least six months – possibly a year or more.
If you retain weekly full backups for at least 90 days, you are looking at 12 copies of data that no one cares about. If you store an on-site and off-site copy of that data, you are looking at 24 copies of that data – most of which is inactive. For a 2TB organization, that’s only 48 TBs of data, and a lot it is certainly manageable. Again brute force will work, 48 TBs is about a dozen pieces of tape media or hard disk drives. But for a 1 PB customer that’s 24,000 TB of data or about 6,000 pieces of media – per year.
Extra copies of inactive data stored on backup systems create a number of unwanted consequences. If you are using disk storage, the storage and replication of this data has costs associated with it. If your storage does not use deduplication, the cost of storing these extra copies of data can be astronomical. If you are using deduplication, the costs are a bit obfuscated. While your deduplication storage system may be able to store 20 copies in the space it takes to store one copy, they will definitely charge you for that ability. You’re still paying for that extra storage, but you’re paying for it in a different way. Some have said it this way: A deduplication system makes 1 TB of storage look like 20 TB of storage, but they only charge you for 10 TB. Which means they’ve figured out how to charge you for 10 TB of storage when all they’re giving you is 1 TB.
Most organizations that are using deduplication in their backup system are also doing it at the target side of the equation, which means that repeated full backups of inactive data still create issues on the backup client side. A full backup has a performance impact to the system being backed up and the network upon which the backups are sent, therefore repeated full backups of inactive data cost your company money by requiring you to buy beefier servers and faster networks.
The challenges of backing up inactive data also appear during a restore. Consider the scenario of a data center that has a petabyte of data, 900 TB of which is an active. Restoring a petabyte of data is a significant undertaking and will take quite a bit of time. Imagine how much faster that restore could go if that environment only had to restore the 100 TB of data the organization is actually using!
Another challenge users create by never deleting anything is it becomes very hard to find anything when you need it. It makes all storage like the junk drawer in your house. You’re never able to find the doodad you’re looking for if it happens to be in the junk drawer. You will find phone chargers for phones you no longer own, paper clips of various forms, old batteries, several dozen barrettes – but no doodad. Primary storage is the same way; when you fill it with data that is mostly inactive, it becomes very hard to find the files that are active.
Once again, this problem exists even with a single laptop user trying to find files on their own computer. Imagine how much bigger that problem is when we were talking thousands of users and petabytes of data. It can render important files effectively lost, making them essentially worthless. The result is the user will duplicate their efforts and re-create the file – making the growth of unstructured data problem even worse.
Bigger companies with petabytes of data also have the issue that they typically have multiple locations with different users creating and using different files. They might want to be able to share some of this data, but when you’re talking petabytes of data, that becomes quite difficult. It also exacerbates the “junk drawer” problem. It’s hard enough to find something in the junk drawer; it’s even harder when you’re not sure which junk drawer to look in.
Acknowledge and Address the Unstructured Data Problem
The only way to address the unstructured data problem begins with acknowledging that it exists. Acknowledge that files are hard to find in a large environment, and even harder to share. Acknowledge that a significant portion of compute, network, and storage resources are consumed storing, copying, and backing up inactive data.
One way to address these problems is to create a global, unified file system that takes all of the above problems into account. This does not solve the problem of users creating millions of files and leaving them there forever, but it does at least put the problem all under one umbrella where you can centrally manage and address the issue. Confront the problems head on and solve them once, rather than having to solve them multiple times around the enterprise.
A file system that is designed to be this large should have integrated search via advanced metadata. Users could easily search via a lot of different metadata in order to find the file that they were working on. They would, of course, continue to have the usual file system semantics that they are used to that would allow them to create directories or subdirectories to help them organize their files. A single file system with federated search would also allow them to find files that others are working on that match the metadata they are interested in.
Most importantly, a file system designed to solve this problem must understand there is active and inactive data, and it must address them differently. The most obvious thing to do is automatically identify and migrate inactive data to less expensive, self-protecting object storage. This would solve a number of problems mentioned above, including wasted money in primary storage and backup storage. Having a file system that understands the difference between active and inactive data also helps make it easier to search for files, as that is one of the pieces of metadata that you can use for a search.
A single, global file system also helps users around the world share data with each other. Users in multiple offices can search the same global file system to find the types of data that they’re looking for and immediately access it – assuming they have the appropriate permissions. Because the global file system understands the concept of inactive data, a search could (if the user wanted it to) include inactive data as well.
The simple act of migrating inactive data out to less expensive object storage also frees up the backup system. It makes backups and restores faster because they don’t have to deal with inactive data. And since 100 TB of primary data turns into 1 PB or 2 PB of backup data, it also saves a significant amount of storage. Some feel data stored in a self-protecting object store doesn’t need to be backed up at all. If you decide to back it up anyway, you can do so in a way that recognizes its nature and store far fewer copies of inactive data in the backup system.
StorageSwiss Take
This problem has been looming for years. Enterprises seem to have a never ending thirst for unstructured data, and IT application developers are developing new ways to leverage unstructured data, making holding onto such data even more attractive. The growth of unstructured data is unlikely to subside anytime soon, so the best you can do is address the problem head on. A solid approach is a global file system that’s built to handle the problem. That includes an understanding of metadata and automated migration of inactive data to inexpensive object storage.
Sponsored by HGST
Very nice