In a recent article, Storage Switzerland discussed methods for protecting an application from a storage system failure. Storage hardware, along with server hardware, is increasing in its reliability. The growing culprit of application downtime is the application itself either through application lock-up, data corruption or poor performance. The problem is that most availability solutions focus on the hardware, the operating system and the network connections – not the application. Any attempt to achieve true application availability should now include an intelligence and understanding about the application itself. In short, true availability should include application awareness.
Why is Application Failure on the Rise?
Applications are no longer single entities that run on one physical server. There is often an interdependent stack of services behind what a user sees as an application. For example their interface may be a web front end but they are not aware that this front end connects to an application server, which ultimately connects to a database server. There are now more moving parts in the presentation of an application. These ‘stacks’ of services were created so that applications could be delivered more quickly and performance tuned more easily. With this modularity comes complexity and with complexity comes an increased chance for failure.
What is Application Aware?
“Application awareness” means that the software providing high availability can monitor specific processes within the application to make sure they are active and responsive. Simply monitoring hardware can provide false positives. For example it’s not uncommon for a blue screened server to respond to a ping, or certainly an application can lock-up or experience data corruption without halting the operating system or server. The availability solution needs to be aware when these application stoppages occur and be able to take corrective action. Solutions also need to be more sensitive to the application than simply monitoring an on/off state. From the users’ perspective, a lack of acceptable performance can be the same or worse than the application being down.
The Clustering Problem
Clustering is often thought of as the ‘go to’ solution for application availability and it comes in two variants, operating system clustering and application clustering. Operating system clustering was designed to protect from a hardware failure and often does not protect from an application outage. Application level clustering was designed to enhance performance, and while it does provide application redundancy it is an expensive and complex way to get there.
One problem with most operating system types of clustering solutions is that other than being complex and expensive, they don’t have this intrinsic knowledge of the application. Their “application” is the operating system. If the operating system is up and running then the cluster application assumes that all is well. Operating system clustering solutions also tend to be very OS-version specific. Each new release brings a new version of the clustering component with new training and additional upgrades required.
As an alternative to operating system-level clustering there are specific application-level clustering solutions from the application vendors themselves. These do typically provide application-specific monitoring but they are often too specific. This means that in environments where there are a variety of applications each will require its own clustering solution. Once again this adds to the expense and complexity of the availability solution as the IT staff needs to be trained on the various solutions.
In addition, application clustering systems are not typically aware of the application stack described above. Most applications are dependent on a database back end that runs on a server, as well as potentially a web front end that accesses the application. From the user’s perspective, if any one of these components fail then the entire application is unavailable. But from a database perspective if the database is up and operational, its job is accomplished.
Whether done at the operating system or the application level, many clustering solutions are typically dependent on shared storage. Again, while storage systems are reliable the chance of a complete RAID failure or storage controller failure remains. As discussed in the recent article even if the risk of a storage system failure is low, the ramifications of its failure and the effort required to return to operations are overwhelming. Considering that clustering is an investment in uptime the shared storage may be too risky to consider.
Both of these solutions also tend to require a high level of identical hardware for the stand-by server and often the stand-by server can perform only the task of waiting for the primary server to fail. This means an expensive allocation of IT budget will be needed to maintain availability. The second node cannot be used as a standby system for more than just that one application and as a result, essentially sits idle waiting for a failure to occur.
Application clusters are better used in extreme performance situations. In these cases the cluster’s nodes are “active-active”, meaning the performance workload can be shared across multiple nodes. The reality is that most data centers deploy these types of clusters for availability, not for performance enhancement. The overwhelming majority of applications provide excellent performance on a single server, and the second server is really only needed for availability.
Finally, most of these clustered environments do not have the ability to extend the cluster over long distances. While they will protect from an internal failure, if that failure impacts the whole data center or the entire region, the application will also fail.
Achieving Application Aware Availability
Application aware availability differs from clustering in a number of ways. First, as the name implies, it is “application aware”. This means that it monitors for all the situations that could go wrong in hardware, like storage failure, network failure or server failure, and also monitors the application itself for problems. Examples of these could be application freeze, data corruption, or failure in one layer of the dependent stacks such as the web front end.
Solutions like these from the Neverfail Group can even detect performance failures, which can be just as important as a hard system failure. Lack of responsiveness can drive a user or worse, a potential customer, elsewhere, costing the business money. But most clustering applications will never provide an alert that something was wrong.
The application awareness also extends to recoverability. If there is a failure in a multi-stacked application environment, systems need to be brought up in a specific order. For example a web front end may need the database back end to be fully up and operational before it starts. Restarting both applications at the same time may cause the web front end to be ready first and then crash because the back end is not yet operational. Application aware solutions can make sure the servers start in the correct order and are operational, not just turned on, before dependent servers begin their startups.
Application aware solutions cannot typically resolve performance problems as application clusters may. As discussed earlier, application clusters, in their active-active state, do potentially offer a viable performance solution. However the major culprit in application performance is not typically processing power, as most are deployed to address hard disk performance. If this is the case, Storage Switzerland has found that many performance issues can be better and more inexpensively resolved by leveraging solid state storage and keeping a single-node solution. This means that an application-aware solution can still be used and performance demands met while offering a significant cost and complexity savings over application clustering.
Application aware solutions are also not dependent on shared storage. Instead a private link is established between servers. That link mirrors data which can be stored on two separate disk storage devices. This means that a customer can invest in two quality, mid-range storage systems with application aware availability software instead of one high-end, very highly available storage system. The net will be better availability and more redundancy for substantially less money.
This lack of reliance on shared storage is part of the reason that the target or standby server can be on dissimilar hardware. While in standby mode this server is merely a target for the primary applications’ data. This abstraction of the secondary server also means that the standby system can actually be a virtual machine in a virtualized server infrastructure. Not only does this have obvious cost benefits but also paves the way for a safety net to advanced virtualization. As will be discussed in the next article in this series many virtualization projects stall when it comes time to virtualize the high-risk applications.
The cost savings from reduced complexity mentioned above cannot be overstated. Many clustering solutions end up becoming ‘shelfware’ because the complexity they introduce to maintain the cluster is not worth the high availability that they theoretically deliver. This is the worst-case scenario. Like the lottery ticket that’s thrown out before the drawing, the money has already been spent, but because of added complexity the payoff is never realized. Application-aware availability brings with it an ease of use that clustering solutions simply do not have. They are cost effective, deployable and maintainable.
Neverfail Group is a client of Storage Switzerland