Articles posted on the Storage Switzerland web site that focus on SSD Reliability are consistently among the top read articles on our site. Clearly there is concern about using the technology in the enterprise and IT planners want to know more about what the problems are related to. The payoff for understanding and effectively working around these issues is increased performance; something that every data center, regardless of size, ultimately needs.
SSD reliability is a bit of a misnomer. The focus instead, should be on SSD durability. Reliability implies that an SSD will fail shortly after being installed in a storage system. While SSDs can certainly fail out of the box, the chances of that are no greater than hard disk drives failing out of the box. Durability means that somewhere down the road, say 2 to 3 years, the SSD will fail.
As we discussed in our report “Why Flash Wears Out and How To Make it Last Longer” SSDs are most commonly made from flash memory modules and have a finite number of times that they can be written to. The number of writes they can sustain is dependent partly on the type of flash used and increasingly on the software that the flash vendor applies to their technology.
Know Your Controller Vendor
The first step is to understand how much capacity of the SSD has been over-provisioned. Most enterprise flash vendors will set aside a certain amount of capacity that is hidden from the server or storage system that the SSD is being installed into. Referred to as over-provisioning, this extra memory is utilized to increase the life expectancy of the flash module. As we discuss in our article “The Cost of Over-Provisioning Flash Arrays” there is a balance though, as each % of storage that is allocated to over-provisioning raises the price per GB of what is already high cost, premium storage.
The next step in overcoming the flash durability challenge and making sure that not too much memory is allocated towards over-provisioning is analyzing the controller technology. This is the intelligence in the flash module that performs a variety of functions including managing writes. In our article “Improving SSD Performance Through Better Flash Management“, we discuss how controllers are becoming increasingly sophisticated and as a result, they can now improve flash write endurance beyond the flash manufacturers specifications.
For example, some controllers can manipulate writes depending on I/O load so that they in effect, write “softer” to the flash and actually lower the write’s impact on the life of the flash.
The third step is to apply the same data protection best practices utilized for managing a HDD based storage system. In a server, this likely means mirroring SSDs or at the very least, only using the SSD as a read cache. In a storage system, this means using some form of RAID to protect against SSD failure. Like hard drives, SSD drives will eventually fail. The over-provisioning and advanced controller technologies described above help ensure that all the modules don’t fail at the same time. A traditional data protection technology makes sure that a single failure does not put data at risk.
The final step is to monitor the amount of life that the SSD or flash module has remaining. Most flash and SSD vendors have a way to report flash life utilization. In fact, many storage system vendors are putting a flash life indicator into their storage management GUI. Make sure the vendor you select includes this metric and that you understand how to capture this important statistic. As a rule of thumb, flash should be replaced during the next maintenance window as soon as it falls below 20% life left.
In a heavy flash environment, monitoring should at least be a weekly task, in a less flash heavy environment, every two weeks should suffice. It is also important not to wait until you are two years into your flash implementation to start monitoring drive life. While most vendors rate their flash life in years, as we discuss in the video below, flash life is really more like tire treads. The more you drive, or in our case write data, the sooner the flash will wear out.
The reality is flash can be more reliable than HDDs and certainly more predictable. Achieving better reliability requires knowing how much of the flash is allocated to over-provisioning, who is the manufacturer of the flash controller and what does that controller manufacturer do to increase flash durability? Better predictability requires setting procedures to ensure that flash is wearing out at the correct rate and being prepared when it needs to be replaced.