Unstructured Data Meets Tape Archiving Efficiency

Dealing With the Unstructured Data Deluge in Higher Education

Colleges and Universities have many of the same issues with efficiently protecting critical data archives as private corporations but arguably, their challenges are even greater. While IT budgets have typically remained flat for private enterprises, some industry sources estimate that state funding for higher education has been reduced by as much as 28% in the past five years. This dearth of state funding combined with the ongoing explosion in the growth of data, is forcing IT planners in academia to find solutions which can both cost effectively manage long-term data archives and provide users with fast access to information.

Big Data Conundrum

It has been well documented that the growth of unstructured data now accounts for the majority of all new data creation; this is certainly also true for higher education. Sources of unstructured data include machine created data, user files, PDF images, power point documents and audio/video files.

For higher education, the data deluge is especially complex as there is a broad range of data intensive applications to support. From digital learning courses and research projects to video surveillance systems and high performance computing applications that consume big data; unstructured data repositories throughout academia are vast and continuing to increase. The vast majority of this data is infrequently referenced again after it is initially created but end users still need to have the ability to quickly access this information on demand.

As data grows on primary storage systems, additional pressure is placed on other resources downstream. Backup jobs, for example, start to become elongated and bandwidth connections begin to become saturated as backup data volumes increase; potentially disrupting other sources of data traffic in the environment. With higher education IT budgets becoming increasingly constrained, adding capacity or upgrading existing primary storage and backup systems to accommodate this growth is difficult and at times not an option.

Compounding this problem is the fact that the various academic departments within college and university environments are all vying for the same IT budgetary dollars to meet their needs. Consequently, infrastructure planners in academia need to find very creative ways to stretch their funds to address a broad range of end user storage and data protection requirements that can address all parties concerned.

Cloud Relief?

Some Universities may be considering public cloud solutions which offer “cheap and deep” storage capacity as a way to efficiently archive their inactive information. However, there are significant challenges, first is that applications need to be able to support cloud protocols to interface with their storage technology—introducing cost, complexity and delays with implementing a solution.

Secondly, while some cloud storage providers offer attractive “teaser” rates for archiving data in their cloud, these costs will quickly begin to increase as data grows; making this option less attractive as a long-term storage archive solution. Essentially with the cloud, the institution has to pay for storage all over again every month. While there may be a large upfront expenditure with purchasing physical storage outright, over time it is still less expensive compared to the cloud model. Retrieving this information could be painfully slow as it would more than likely be traversing network links with limited bandwidth.

Disk-on-the-cheap?

Another option for storing archive data is low cost disk like SATA. There are multiple suppliers that offer multi-PB, direct attached or SAN attachable storage enclosures that can be very cost effective on a cost per GB basis. Moreover, these enclosures can be quickly racked, powered up and provisioned out to satisfy growing application data stores. The challenge with this approach, however, is that rotational disk media requires a significant amount of power and cooling. The year-over-year costs of powering and cooling potentially thousands of disk drives would be prohibitive and consequently, not an ideal approach given the budgetary challenges in academia today.

Cache Powered Tape

What is needed is a solution that can leverage the economies of scale of a low cost storage medium, like tape, to efficiently archive Petabytes (PBs) worth of information. To satisfy application service level agreements (SLAs) however, the solution should also be capable of providing rapid access to data retrieved from the tape archive through an efficiently sized front-end solid state storage cache.

In this manner, as data went from an inactive to an active state, it would get retrieved from the tape archive and immediately promoted up into the cache so that users would have rapid access to this information. The first access would be at tape speed but subsequent accesses of the same file would be at the speed of SSD.

NAS Converged Tape

An ideal architecture to implement this solution would consist of a Networked Attached Storage (NAS) appliance combined with a fully integrated intelligent back-end LTFS (Linear Tape File System) tape library. The NAS and LTFS combination makes for an ideal archiving platform because unstructured data is typically stored as files and both technologies natively support file system storage access. NAS is also an easy storage technology to deploy since it can plug into existing ethernet network switches and present a CIFS or NFS network share to allow connecting servers to access unstructured files.

The open nature of an LTFS tape storage repository is also beneficial since data is stored in its native format. Prior to the introduction of LTFS, tape archives could only be referenced by the proprietary backup or archiving application that had originally written the data to tape. With LTFS, any application loaded with the LTFS driver is capable of accessing the information in the tape archive. This is a critically important capability since data archives need to be accessible far into the future. Keeping the archive in an open format helps make this possible.

From a user perspective, the process for accessing data would be completely transparent. If the file is on primary storage, it will be served up from that repository. If it is on tape, there will be an initial delay while the data is restored to the NAS disk cache, however, it will not require a separate process to retrieve the information. And as stated above, subsequent access to the restored data will be at cache speeds. These solutions could still play a role in cloud storage solutions providing a hybrid approach where data is not totally dependent on the cloud.

tNAS Fueled ROI

This type of approach, also referred to as a tNAS (for tape NAS), would enable IT planners to implement a tiered storage strategy to reduce primary storage costs. By migrating or copying inactive data from primary storage systems to the tNAS via simple file systems, infrastructure managers could free up primary storage space and potentially defer the need for additional disk storage purchases. The resulting savings produced from freeing up primary storage assets alone, could make for a quick return on investment (ROI) of the tNAS.

In addition to deferring costly storage expenditures, there could also be savings on disk power and cooling and a reduction in the amount of data center floor space required to house additional disk storage systems. In fact, when these operational overhead reductions are taken into account over a multi-year timeframe, the collective savings could be quite substantial; particularly for those environments supporting PB’s worth of information.

Lastly, with LTO-6 tape technology’s ability to store 2.5TB’s of data natively on a single cartridge, academic institutions could build a highly cost effective tape archive to satisfy their multi-PB unstructured data archive needs far into the future. What’s more, this archive would obviate the need to run separate backup processes for unstructured data once it migrated to the archive; helping to reduce backup traffic induced network latency and delivering additional savings in backup software licensing costs. All of this combines to a significant reduction in total cost of ownership while improving the standard of protection.

Conclusion

Storage archiving technologies, like the StrongBox offering from Crossroads Systems, are making it financially feasible for budget strapped higher educational institutions to cost effectively manage their growing unstructured data storage repositories. By combining the efficiencies of a highly dense tape storage archive with a front-end NAS disk cache, academic IT organizations can implement a tiered storage infrastructure which seamlessly plugs-in to their existing primary storage environment to reduce ongoing storage capital and operational expenditures.

Crossroads Systems is a client of Storage Switzerland

Click Here To Sign Up For Our Newsletter

As a 22 year IT veteran, Colm has worked in a variety of capacities ranging from technical support of critical OLTP environments to consultative sales and marketing for system integrators and manufacturers. His focus in the enterprise storage, backup and disaster recovery solutions space extends from mainframe and distributed computing environments across a wide range of industries.

Tagged with: , , , , , , , , ,
Posted in Article
9 comments on “Unstructured Data Meets Tape Archiving Efficiency
  1. SteveO says:

    I think that organizations need to carefully consider large tape deployments. Your article outlines some of the wins very well, but doesn’t address some potentially complex and expensive considerations.

    From an opex perspective, tape will eventually become more-expensive to manage. If an organization builds a turnkey infrastructure, where media expires in-frame and constantly feeds a “scratch” pool, the labor costs of managing tape shouldn’t be much more than managing disk systems. However, if organizations need to get into the business of media management, the labor costs of doing so could become massive, not to mention the issue of tape storage external to the library.

    LTFS also has a limitation of file size not being able to exceed the capacity of a single piece of media. This shouldn’t be a common problem, but in our organization, we have several customers who create large files, which continuously get appended to.

    The real kicker in terms of complexity of LTFS is regarding the LTO lifecycle. LTO standards say that LTO will be backwards compatible for reads two versions back and write compatible one version back. At some point, organizations will need invest in media migration, which could get very, very expensive in capex and opex.

    Data protection also poses challenges, but not every data set requires ii.

  2. jas says:

    It is no longer correct to say that LTFS cannot span more than one piece of media (and hasn’t been for some time).

    Secondly media migration can be automated in which case the overhead is not that great.

    • SteveO says:

      As far as I am aware – and I am not an LTFS user, the maximum file size must be less than the size of a single piece of media. According to a SNIA document created in May 2013:

      http://snia.org/sites/default/files/CloudTapeUseCases_v1.0.pdf

      “Spanning of large files across tapes (LTFS work in progress)”

      You’re right about automation, but there’s still a significant capital expense involved:

      Let’s say I have a tape library, 1000 slots, 10 drives. In order to migrate:

      – I’ll need 10 “new version” drives to continue writing data at my current ingest rate
      – I’ll need additional frame space & “new version” media to write to
      – I’ll need to purchase and license automation software, such as FileTek. Not sure if there are open source options available, which could reduce cost – if “it just works”

      In the above, I’ve probably doubled my initial investment in my 1000-tape library. If my organization did something like dual in-line copying of tapes for redundancy, my level of effort just increased, again.

      • Colm Keegan says:

        Good comments SteveO.

        I’ll have to agree with JAS on the spanning issue, there are some archive software packages that address that.
        Also, how many files are really going to be bigger than an LTO tape, with current LTO6 at 2.5TB native and 2x or more compressed?

        The LTO lifecycle is something to think about, but with each drive reading back two generations, it’s probably 10 years before you won’t be able to buy a drive to play tape you record today. In my mind you’d be hard pressed to find a disk-based system with a longer lifespan than that.

        In the end, for large, long term storage nothing comes close to the cost and density of LTO tape. And that differential is going to get more dramatic with future LTO generations.

      • SteveO says:

        So, if I understand this correctly, the LTFS spec itself doesn’t support spanning media, but there are volume managers that do. Thanks to both you and JAS for providing that insight. It might actually cause me to look at LTFS again …..

        That said, while maintaining backwards compatible for reads every 10 years makes sense, in a long-term archive, we’ll still talking about replacing gear every 5 years to maintain write compatibility. That’s about how long we keep gear on the books, here. I still maintain that the notion that tape will always be cheaper than disk not a slam dunk in all situations.

        Consider an object store, running geodispersal technology. In certain configurations, 1.3PB of raw disk can yield storage for 1PB of data. Considering the use case, where an organization must pay operators to keep swap tape drives in and out of a library, coupled by the cost of software to handle data integrity checking on tape and automatic migration of media to new versions, the cost of disk and tape get pretty close.

        Plus, the lifecycle model with object storage becomes quite easy to manage. I might buy an object store with trays of 4TB drives today. Tomorrow, as those drives fail, I’m able to put a 6, 8, 10, whatever capacity drives are most cost-effective and available at the time. I’m not forced to lifecycle entire arrays, as in the scale-down model. As long as throughput is acceptable, I’m managing disk trays, which could last 10+ years. Point being, my disk system can live as long as your 10-year model for tape 🙂 Granted, there will be many component upgrades, but the system itself *can* live on.

        As far as capacity-per-unit, you’re right – LTO has specs for density far beyond where disk capacity is projected to be.

        Thanks for your thought-provoking articles. I do enjoy reading your pieces.

        –S

  3. Colm Keegan says:

    Thanks for all the feedback Steve. We like to have a 2-way conversation with our readers so please feel free to chime in whenever you like. I can’t disagree with your comments about object storage. Bottom line is you need to factor in all your unique requirements and determine what trade-offs you can live with to attain efficiencies without sacrificing the long-term viability of the solution. An interesting use case for the StrongBox technology by the way, is from a not-for-profit organization called Cyark. I met with them back in the spring and wrote the following briefing note:

    http://www.storage-switzerland.com/blog/entries/2013/4/12_Archiving_the_Wonders_of_the_World_-_Cyark_and_Iron_Mountain.html

    This post is actually on our older site so be sure to come back to storageswiss.com for all the latest posts, videos, etc.

    Thanks again!

  4. […] we discussed in our article “Unstructured Data Meet Tape Archiving Efficiency”, this problem has largely been solved thanks to network mountable archives. Both disk and tape […]

  5. […] retrieving data from tape is not as fast as disk, some vendors have introduced data archiving solutions which utilize a hybrid of disk and tape to enable organizations to get the best of both worlds—a […]

  6. […] is still accessed the same way no matter where it is stored. As we discussed in our article, “Unstructured Data Meets Tape Archiving Efficiency”, neither the user nor the administrator needs to know the exact location of the file. They just […]

Comments are closed.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 22,186 other followers

Blog Stats
  • 1,516,391 views
%d bloggers like this: