Thanks to decreasing flash media pricing and data efficiency technologies like deduplication, all-flash arrays are now commonplace in many data centers. In most cases though, the purchases of first generation all-flash arrays were to solve a specific performance problem. Now though, thanks to the continued decline of flash media pricing plus the increase in flash density, organizations are considering using flash for all their storage needs. Is a flash-only strategy the right way to go or should IT consider an all-flash alternative that delivers similar performance while being more affordable and better suited to long-term data retention?
The Truth about Data
Study after study indicates organizations haven’t accessed over 80% of their data in the last three years. The problem is that in most data centers 100% of this data remains on the primary storage system. The only time the data moves is during a migration to a new storage system that is replacing the old one.
Another problem is that most organizations still count on their backup product to provide data archiving and retention, maintaining years of data within the backup application. The problem is that while most data is inactive for years when it does become active the organization needs to find it and make it available quickly. This dichotomy is part of the motivation for leaving data on primary storage.
Most organizations are in dire need of a data management strategy. Data, for the most part, exists in one of two extremes. It is active, or it is dormant, there is seldom an in-between. IT needs an architecture that delivers the high performance that the active data requires while also providing long-term data storage and retention to keep costs down and meet compliance requirements. The design also needs to address the concern of data suddenly becoming active after years of being dormant.
Designing a Data Management Strategy for the Real World
If over 80% of an organization’s data is dormant, it gains no performance benefit from being on flash. Moreover, flash, despite the decreases in media cost, remains more expensive than hard disk drives, especially when factoring in the total system cost. It takes more processing power and network IO to make sure that the storage system keeps up with the flash media. Most organizations though, keep data on their primary storage systems, which are now increasingly all-flash, because the concern over the sudden reactivation of data is higher than the concerns over the cost of storage.
The longer data is dormant, the more likely it is to remain dormant. In other words, accessing data that is 200 days old is more likely than accessing data that is 365 days old. Additionally, the older data is, the less time pressure there is on recovering that data. Accesses of data this old are event-driven; responding to a discovery request or preparing data for analysis are two examples. It is still necessary to deliver the data quickly, but instant delivery is not required.
Primary storage should address most of the concern over data access, but instead of placing data that is less than a year old on the fastest flash array possible, it should automatically provide data management. The primary storage system could be either a two-tier all-flash array, where the most active 5% of data is on high-performance NVMe flash and the less active data is on SAS based high-density flash. Alternatively, many organizations should do fine with a small high-performance flash tier and a larger medium performance hard disk tier. The determining factor is how concerned the organization is over the responsiveness of the second tier. If users make requests of 90 day plus data frequently, a dual-tier flash system may make the most sense.
An essential element of these hybrid systems is that data movement is automatic, transparent and internal. Data does not need to move across the network for it to be available. Additionally, the hybrid systems operate at a block level and require no modifications to the applications or operating systems.
The next component of the architecture is a long-term storage device, typically an object storage system. The objective is to move infrequently accessed data to the object store for long-term data retention. The design of the object store is to protect not only against media failure but also provide data durability, which protects against data degradation. Moving data to the second tier on a file-by-file basis makes it easier to meet compliance regulations like GDPR.
The cost of an object storage system is far less expensive than an all-flash array. It uses the highest capacity hard disk drives possible, which are less expensive than flash media. It also doesn’t need to use high-end CPUs and networking ports since performance is not its primary concern. Despite the more modest performance, the system can move data back to the primary storage system very quickly when necessary.
The organization uses an archiving software solution to identify data to move to the object storage system automatically. To make it easy to find the data in the future, it indexes each file’s metadata information. Some solutions provide a transparent recall capability that automatically copies the file back to primary storage on access.
Conclusion
Instead of creating an all-flash data center, a more realistic approach is to create a hybrid storage architecture. The object storage system keeps the cost of the primary storage system to a minimum, and it eliminates the need for an additional backup storage device since backup software solutions can send primary storage backups to the object storage system. The result is a data center with two storage systems that meet both its performance needs and its budget realities.
Sponsored by Tegile, a Western Digital Brand