When the subject of application or data availability comes up in IT meetings the focus of the discussion is what to do when an application fails or the storage system that holds that application’s data fails. Occasionally with a savvy IT team the scope might expand enough that the discussion might also include the rest of the application stack like web servers and database servers. Poor performance is often left out of the conversation. The problem is that in today’s data center, poor performance is as bad, if not worse, than actual system downtime.
The impact of poor performance
Poor performance can come from a variety of sources, and the tools to diagnose and correct performance problems are almost infinite. The problem though is that very few availability products on the market monitor poor application performance and have predefined procedures that can be executed if performance is not meeting a certain service level agreement (SLA). Even application monitoring tools that can identify the problem don’t typically have a failover mechanism, or again those predefined procedures, that can be executed during a failure. Users and customers are expected to limp through the problem until IT has the time to identify that one exists, pinpoint its cause and then take corrective measures. Also, these corrective measures, especially since they are often ad-hoc, may actually cause more performance degradation or even downtime while they are enacted. Whether the system is down or the system is merely slow to respond, customers and prospects quickly move on to the next option.
Poor performance is worse than downtime
When the performance of an application suffers the user reaction can be worse than when the application is down. Of course in neither case are the users happy but at least when the application is down the user knows what the problem is, they probably instantly get a message that something is wrong. It is a classic on vs. off. They at least know the application is “off”, while they can’t do anything about it at least they know there is a problem.
Most IT reporting tools will inform when an application is totally down so IT likely knows about the problem too. When users call the help desk there is some comfort in the fact that IT knows there is a problem and they are trying to fix it. Even if this is an external application used by the public, at least they are getting an error message immediately on their browser. Again it is an “off” situation. The external or public user will not waste their time trying to troubleshoot their environment and network. They will know that the problem is not theirs. The public consumer may still leave due to satisfaction issues so it is still very important to protect against these issues but they are less infuriating than a slow response issue. Finally, in most cases IT and the users know what the source of the problem is. A server crashed, the application halted or a storage system failed. IT is well on the way to fixing it at this point, and they look informed.
Performance is a different situation, the application is responding, just much more slowly. Nobody knows why and in fact IT may not even be aware of it. When users call the help desk to complain, this may be the first they have heard of a problem. This adds lack of confidence to the already full list of user concerns. Worse if the user is external, they don’t get an immediate notification, systems are slow to respond. They may waste time troubleshooting their systems and network connections only to learn that the problem was elsewhere, which of course adds to the frustration.
Performance aware availability solutions
To fix these situations IT Managers need to begin to implement availability solutions that not only protect them from a site disaster or catastrophic failure, but also something that can protect them from a slow responding server, network connection or storage system. Accomplishing this goal will require the use of application aware availability solutions like those from The Neverfail Group that not only can protect against a site disaster, but also protect against the impact from an underperforming application.
What solutions like Neverfail’s do is constantly monitor the applications that they are already providing HA for, looking for lack of application response. If one of those situations occur, a series of pre-defined steps and processes can be triggered, automatically or manually, to help resolve the issue. Some issues can be solved in-place with Neverfail’s ability to detect and correct common system performance issues.
Should the problem not be resolved with remedial actions, Neverfail can move the application to a new server, new storage system or a new virtual host. This allows the application and the service that it provides to return to normal performance levels while the original system and its environment can be analyzed to see what exactly is wrong. It also provides a safe zone for the application as remedies to fix the performance of the original environment are identified and implemented. When the corrections have been safely made they also allow for the corrected environment to be updated from the failed over state with the latest data set. The failed over state can remain intact until the corrected original can be verified with a full performance load placed on it.
Some virtualized environments have the ability to monitor virtualized machines and perform automatic load balancing based on host processor utilization or memory contention issues. This is not the same as performance aware availability. Applications running on VMs can be responding to users poorly and never stress the host level thresholds of processor, memory and I/O. The availability application needs to be communicating with and understanding the application itself to know the performance state. It almost has to be a “user” of the application.
The important step in this process after the ability to detect these application failure situations is to have those pre-defined steps in place. Disasters typically provide one of two time windows in which IT has time to react. The first is a few hours or few days, like when a hurricane is heading toward the data center. In this situation systems can be carefully analyzed as they fail. The other time window is the instant window in which no warning is given, an earthquake or fire might be a good example. In the instant time window, failover has to just happen or at worst be able to be triggered by a single button click. Poor performance is one of those disasters that strikes without warning and is well served by a predefined series of steps.
Availability solutions are often thought of as something that is used in case of disaster. Instead, availability solutions should be looked toward to help IT respond to almost any unplanned event. Impacted system performance is one of the most common examples of an unplanned event. By having pre-defined steps in place, performance problems can be navigated through without impacting the users and while improving diagnoses. Availability solutions like those from Neverfail protect your users from downtime and other impacts to availability by protecting from a wide range of failures, including those that might otherwise be an unconsidered inconvenience. Ensuring availability through all these events helps to maximize your customer satisfaction.
Neverfail Group is a client of Storage Switzerland