How To Create an Always Predictable Data Center

Posted on June 12, 2017 by George Crump

The modern data center must not only be predictable for normal production operations but also in failed state situations. An “Always Predictable” storage architecture is key to enabling IT to meet the various service levels the organization requires.

The Predictable Primary Storage Challenge

The first type of predictability is how consistent performance is under normal operating conditions. The problem is that “normal” is never a static state. The number of application and user requests will ebb and flow throughout the course of the day. Additionally, the type of workloads that will make these requests may vary considerably. Consequently, the primary storage system needs continuous monitoring and management so the system itself or IT admins can respond to those fluctuations throughout the course of the day.

All-flash storage systems seem to be the easy answer but not all of these systems are created equally. Many still don’t have the robust feature sets that enterprise environments count on to monitor, manage and protect the data they store. They also don’t support the full range of storage protocols that a mixed workload requires.

The lack of comprehensive data and management feature functionality in many all-flash offerings, often results in IT buyers acquiring multiple all-flash point solutions just to meet the variable performance requirements of their application workloads. While point all-flash arrays may help remedy specific application performance problems, they can create predictability challenges since it requires IT to use multiple interfaces to manage and monitor each system.

The Expectations of Predictability of Availability

While ensuring consistent primary storage performance is critical, the organization must also understand how it can provide consistent performance in a failed state caused by a software problem, a hardware failure, a natural disaster or a cyber-attack. While most organizations may accept a minor drop off in performance, the failed state performance needs to be consistent.

For example, if on a scale of one to 10, production performance is always an eight, it may be acceptable that failed state performance is a six. But it must always be a six and not drop to a four under duress.

The organization also will need a consistent cutover expectation. It may take two hours to restart operations at the DR site, but it should always take two hours no matter what triggered the disaster declaration.

Creating Predictable Availability

The first step towards ensuring predictable availability is to protect data frequently to meet recovery time and recovery point objectives (RTO/RPO). The primary storage system can help meet these objectives since most systems have snapshot capabilities.

Snapshots provide for rapid data capture, with minimal storage space consumption. Snapshots, in most cases, will provide protection from application corruption or user error. However, they are vulnerable to hardware failure, site failure and cyber-attack. To protect from these threats, the snapshot needs to be copied to a secondary storage system as quickly as possible. In addition, the copy of the snapshot should be intelligent, only copying the data that changed between protection events.

Copying data to a secondary storage system protects against hardware failures and most cyber-attacks but not against site failure and not against all cyber-attacks. Data should be replicated to a secondary site or “rented” in the cloud. The secondary off-site copy provides protection from a site-wide disaster. One final step that is returning to popularity is making a final copy to tape media. The offline copy of data that tape provides is the ultimate protection from cyber-attack.

The Predictability of Recovery

The next step is for IT to provide predictable time frames for recovery; also known as RTO. Generally, a specific time should be set for each of the potential failures; software or user error, hardware failure (server or storage), site failure, and cyber-attack. As long as the organization is capturing data frequently enough, the only variable to eliminate is the time to re-position data to make it accessible.

A software or end user error should be resolved with a snapshot rollback. Hardware failure requires the use of the copy on secondary storage. A key is for the data protection solution to provide a recovery-in-place capability in order to access data directly from the secondary storage system, thereby eliminating or minimizing the need for network data transfers.

Recovery from a site-wide disaster obviously has its own unique logistics to factor, including networking and potentially relocating users. Mission critical applications likely need a near real-time copy of data in the remote location created by the storage systems replication capabilities. Mission critical applications may also need standby compute for rapid recovery. Other applications could recover from the remote secondary storage system but once again, recovery-in-place functionality will help speed-up recovery efforts.

In a disaster, it is also important not to try and boil the ocean and attempt recover everything at once. Only the most active data and applications need priority. The data protection application should provide the ability to prioritize this type of recovery.

The Predictability of Recovered Performance

The final step is predicting performance during the failed state. This means the hardware used in recovery, while not necessarily performing to the level of normal production systems, should offer reasonable performance so the organization can get back to work. The motivation to cut recovery and DR costs is always high. IT needs to be careful, however, not to slash costs jeopardizing predictable performance in the event of a disaster.

Evolving from Protection to Availability

Designing an Always Predictable data center requires multiple moving parts often from more than one vendor. The data protection solution needs to evolve from something that is bolted onto the back of the data center to something that is at its center, an availability solution. An availability solution integrates with primary storage and its snapshots to fulfill the secondary and off-site copy requirements as well as re-position data as needed based on the threat at hand.

To learn more, register for a webinar with Storage Switzerland, NetApp and Veeam we call, “Private Cloud transformation with All-Flash FAS performance and Always-On Availability”. In the webinar we will cover:

How to reliably meet application performance service levels in private cloud environments
How to enhance recovery point and recovery time objectives for all applications and data
Why integrated solutions deliver simplified IT operations, lower costs and improved agility

About George Crump

George Crump is the Chief Marketing Officer at VergeIO, the leader in Ultraconverged Infrastructure. Prior to VergeIO he was Chief Product Strategist at StorONE. Before assuming roles with innovative technology vendors, George spent almost 14 years as the founder and lead analyst at Storage Switzerland. In his spare time, he continues to write blogs on Storage Switzerland to educate IT professionals on all aspects of data center storage. He is the primary contributor to Storage Switzerland and is a heavily sought-after public speaker. With over 30 years of experience designing storage solutions for data centers across the US, he has seen the birth of such technologies as RAID, NAS, SAN, Virtualization, Cloud, and Enterprise Flash. Before founding Storage Switzerland, he was CTO at one of the nation's largest storage integrators, where he was in charge of technology testing, integration, and product selection.

Tagged with: All-Flash, dr, performance, Recovery in Place, Replication, RPO, RTO, SLO, Snapshot, Veeam
Posted in Blog