Flash has made its way into enterprise data centers, replacing or augmenting hard disk drives (HDDs) in more and more servers. But many users seem to fixate on one characteristic of NAND flash solid state drives (SSDs), the fact that they wear out. There’s a common assumption that SSD wear leads to failure and potential data loss and the fact that flash has a finite lifespan means it will somehow deliver less total value. This article will compare wear, risk and value in SSDs and HDDs to see if it’s time to stop worrying about flash wear out.
Watch Our Webinar On Demand "Three Keys To Choosing The Right Server Flash"
Wear, Risk and Value
Components or sub-systems eventually wear out and when they do they can fail. In IT, and especially storage, this can mean data loss, so there can be a risk associated with wear in storage systems. The other concern is overall value, when a component wears out its useful life is essentially over and it can’t provide any more value. So the concern over wear is that it causes a potential increase in risk and a potential decrease in value.
This paradigm of wear and wearing out in a storage context comes from HDDs. These highly complex electro-mechanical sub-systems essentially run until they fail, that’s when they’re worn out. And drive failure can mean loss of data or data availability, so wear usually brings a certain amount of risk.
There are a number of analog and digital indicators that precede a drive failure, but these are more like alarms rather than tools by which to manage wear and avoid wear out. HDDs don’t come with an expected lifespan rating and don’t really have a measurable indicator of wear that can show what portion of that useful life remains.
SSDs, on the other hand, have a much different wear and wear-out scenario. SSD wear is gradual and predictable and doesn’t present any real risk of data loss when they’re “worn out”. To see how this happens, let’s take a closer look at how flash SSDs wear.
How NAND Flash Wears
Each digital value in an SSD (a 0 or 1) is represented by voltage threshold on a transistor ‘gate’ within each bit cell of a flash chip. That digital value is changed by increasing (or decreasing) the number of electrons that are trapped on that gate which pushes its voltage level above or below those particular thresholds. To do this electrons are forced through an insulator layer, a process that changes its conductivity. The physical act of moving electrons through that insulator layer can cause it to degrade or wear over time.
Insulator wear reduces its effectiveness in trapping electrons creating less stable, less consistent properties of the semiconducting layer. This results in variations in voltage levels within the bit cell and an increase in bit errors.
How the Flash Controller Handles Wear
All flash cells incur bit errors, which are corrected by the flash controller through various software routines. When a flash cell wears, these bit errors can increase, causing the controller to invoke correction processes more frequently. Eventually these bit errors may cause the SSD to exceed its uncorrectable bit error rate, at which point the SSD is ‘worn out’. But this process is gradual and predictable and the point at which the performance or reliability of the flash device becomes unacceptable is easily identified. For this reason, flash vendors specify the total number of bytes each SSD can successfully write before it reaches this threshold, a spec called Total Bytes Written (TBW).
What Vendors do about SSD Wear
This insulator wear is a fact of life for SSDs, so manufacturers are taking steps to address the problem in a couple of ways. They’re devising processes to maximize the useful life they can get out of a given device and are developing methods to help users manage this wearing process and replace these devices in a controlled fashion. This enables IT to plan a replacement cycle and avoid the surprise so often seen with catastrophic HDD failure.
Maximize Flash Life, Maximize SSD Useful Life
NAND flash is erased in full block increments, which creates the need for additional data handling as ‘good’ data is consolidated from blocks so they can be erased. This copying of data generates more write cycles (called write amplification), causing additional wear on the SSD. Manufacturers are developing sophisticated controller processes that minimize this write amplification. Modern operating systems do their part to extend SSD life as well, aligning IOs to 4K boundaries on writes to improve write efficiency and reduce the consolidation process mentioned earlier.
Manage Flash Wear
Even more important than minimizing wear is the ability to manage an SSD’s lifespan. Since the TBW spec for a flash drive is known, it’s simple for flash vendors to display the amount of life remaining, like a gas gauge. Also, flash life is expressed in terms of data bytes written, not in terms of hours the device has been running. This makes it easy for users to plan for replacement of an SSD before it reaches that point. And if it does, the data can still be read, allowing for a simple migration to the new flash device.
The second cause for concern about flash wear is the question of value. Again, using the HDD paradigm, the common metric is cost per GB. But since every storage device will be filled thousands of times in its lifespan, a better metric would be the number of GBs the drive can write and successfully read back over that lifespan. A comparison of HDDs and SSDs in this context is very interesting.
If we take a typical HDD measured performance spec of 477 IOPS * and run it 24 x 7 we come up with ~15B I/Os per year. An enterprise-grade SSD rated at ~25,000 IOPS ** can produce over 750B I/Os per year, 50x that of the HDD. Over a 5-year lifespan the HDD would produce ~75B IOs and the SSD about 3.75 Trillion IOs, theoretically. But given this same enterprise SSD’s TBW limit, the maximum is 1.4 Trillion IOs *** and the comparison drops to ~19x increase in useful work done by the storage device. The bottom line is that flash SSDs provide more value, in terms of ability to do useful work, over their lifespan, even when ignoring their considerable performance advantage.
* a SAS 2.5” HDD running an 8K random, 65% read and 35% writes
** Micron’s 410m 2.5” 200GB SAS SSD
*** (3.5PB TBW / 8KB random write) x 3 (1/3 of total I/O are writes)
Conventional thinking about storage holds that wear leads to failure and failure brings risk and a reduction in value. But that paradigm is based on hard disk drive technology. Flash SSDs do wear, but they don’t fail when they ‘wear out’ and that wear is manageable. This means they don’t present the same risk as HDDs, which can fail catastrophically, and they deliver significantly more value over their lifespan, based on the amount of data they can write and reproduce accurately. They also provide significantly better performance which can enable new use cases.
Micron is a client of Storage Switzerland