Is it Time to Stop Worrying about Flash Wear Out?

Flash has made its way into enterprise data centers, replacing or augmenting hard disk drives (HDDs) in more and more servers. But many users seem to fixate on one characteristic of NAND flash solid state drives (SSDs), the fact that they wear out. There’s a common assumption that SSD wear leads to failure and potential data loss and the fact that flash has a finite lifespan means it will somehow deliver less total value. This article will compare wear, risk and value in SSDs and HDDs to see if it’s time to stop worrying about flash wear out.

Watch Our Webinar On Demand "Three Keys To Choosing The Right Server Flash"

Wear, Risk and Value

Components or sub-systems eventually wear out and when they do they can fail. In IT, and especially storage, this can mean data loss, so there can be a risk associated with wear in storage systems. The other concern is overall value, when a component wears out its useful life is essentially over and it can’t provide any more value. So the concern over wear is that it causes a potential increase in risk and a potential decrease in value.

HDD Wear

This paradigm of wear and wearing out in a storage context comes from HDDs. These highly complex electro-mechanical sub-systems essentially run until they fail, that’s when they’re worn out. And drive failure can mean loss of data or data availability, so wear usually brings a certain amount of risk.

There are a number of analog and digital indicators that precede a drive failure, but these are more like alarms rather than tools by which to manage wear and avoid wear out. HDDs don’t come with an expected lifespan rating and don’t really have a measurable indicator of wear that can show what portion of that useful life remains.

SSDs, on the other hand, have a much different wear and wear-out scenario. SSD wear is gradual and predictable and doesn’t present any real risk of data loss when they’re “worn out”. To see how this happens, let’s take a closer look at how flash SSDs wear.

How NAND Flash Wears

Each digital value in an SSD (a 0 or 1) is represented by voltage threshold on a transistor ‘gate’ within each bit cell of a flash chip. That digital value is changed by increasing (or decreasing) the number of electrons that are trapped on that gate which pushes its voltage level above or below those particular thresholds. To do this electrons are forced through an insulator layer, a process that changes its conductivity. The physical act of moving electrons through that insulator layer can cause it to degrade or wear over time.

Insulator wear reduces its effectiveness in trapping electrons creating less stable, less consistent properties of the semiconducting layer. This results in variations in voltage levels within the bit cell and an increase in bit errors.

How the Flash Controller Handles Wear

All flash cells incur bit errors, which are corrected by the flash controller through various software routines. When a flash cell wears, these bit errors can increase, causing the controller to invoke correction processes more frequently. Eventually these bit errors may cause the SSD to exceed its uncorrectable bit error rate, at which point the SSD is ‘worn out’. But this process is gradual and predictable and the point at which the performance or reliability of the flash device becomes unacceptable is easily identified. For this reason, flash vendors specify the total number of bytes each SSD can successfully write before it reaches this threshold, a spec called Total Bytes Written (TBW).

What Vendors do about SSD Wear

This insulator wear is a fact of life for SSDs, so manufacturers are taking steps to address the problem in a couple of ways. They’re devising processes to maximize the useful life they can get out of a given device and are developing methods to help users manage this wearing process and replace these devices in a controlled fashion. This enables IT to plan a replacement cycle and avoid the surprise so often seen with catastrophic HDD failure.

Maximize Flash Life, Maximize SSD Useful Life

NAND flash is erased in full block increments, which creates the need for additional data handling as ‘good’ data is consolidated from blocks so they can be erased. This copying of data generates more write cycles (called write amplification), causing additional wear on the SSD. Manufacturers are developing sophisticated controller processes that minimize this write amplification. Modern operating systems do their part to extend SSD life as well, aligning IOs to 4K boundaries on writes to improve write efficiency and reduce the consolidation process mentioned earlier.

Manage Flash Wear

Even more important than minimizing wear is the ability to manage an SSD’s lifespan. Since the TBW spec for a flash drive is known, it’s simple for flash vendors to display the amount of life remaining, like a gas gauge. Also, flash life is expressed in terms of data bytes written, not in terms of hours the device has been running. This makes it easy for users to plan for replacement of an SSD before it reaches that point. And if it does, the data can still be read, allowing for a simple migration to the new flash device.

Value

The second cause for concern about flash wear is the question of value. Again, using the HDD paradigm, the common metric is cost per GB. But since every storage device will be filled thousands of times in its lifespan, a better metric would be the number of GBs the drive can write and successfully read back over that lifespan. A comparison of HDDs and SSDs in this context is very interesting.

If we take a typical HDD measured performance spec of 477 IOPS * and run it 24 x 7 we come up with ~15B I/Os per year. An enterprise-grade SSD rated at ~25,000 IOPS ** can produce over 750B I/Os per year, 50x that of the HDD. Over a 5-year lifespan the HDD would produce ~75B IOs and the SSD about 3.75 Trillion IOs, theoretically. But given this same enterprise SSD’s TBW limit, the maximum is 1.4 Trillion IOs *** and the comparison drops to ~19x increase in useful work done by the storage device. The bottom line is that flash SSDs provide more value, in terms of ability to do useful work, over their lifespan, even when ignoring their considerable performance advantage.

* a SAS 2.5” HDD running an 8K random, 65% read and 35% writes

** Micron’s 410m 2.5” 200GB SAS SSD

*** (3.5PB TBW / 8KB random write) x 3 (1/3 of total I/O are writes)

Conclusion

Conventional thinking about storage holds that wear leads to failure and failure brings risk and a reduction in value. But that paradigm is based on hard disk drive technology. Flash SSDs do wear, but they don’t fail when they ‘wear out’ and that wear is manageable. This means they don’t present the same risk as HDDs, which can fail catastrophically, and they deliver significantly more value over their lifespan, based on the amount of data they can write and reproduce accurately. They also provide significantly better performance which can enable new use cases.

Watch Now On Demand

Watch Now On Demand

Micron is a client of Storage Switzerland

Click Here To Sign Up For Our Newsletter

Eric is an Analyst with Storage Switzerland and has over 25 years experience in high-technology industries. He’s held technical, management and marketing positions in the computer storage, instrumentation, digital imaging and test equipment fields. He has spent the past 15 years in the data storage field, with storage hardware manufacturers and as a national storage integrator, designing and implementing open systems storage solutions for companies in the Western United States.  Eric earned degrees in electrical/computer engineering from the University of Colorado and marketing from California State University, Humboldt.  He and his wife live in Colorado and have twins in college.

Tagged with: , , , , , , ,
Posted in Article
12 comments on “Is it Time to Stop Worrying about Flash Wear Out?
  1. Stefan says:

    Under what write amplification assumption is the Total Bytes Written value calculated? In my opinion this depends heavily from situation. E.g a 100% full ssd mit 90% static data and 10% writes to the same blocks will have much less TBW usable.

    • Keith says:

      Stefan, all SSDs have “wear leveling”, so if you write the same small area, they will move that and exchange it with the static data. This wall all the transistors will be used evenly, and maximize the drive life.

    • Eric Slack says:

      Thanks for your response Stephan. You’re right on the write amplification comment – it can impact TBW, although all flash vendors over provision their devices to accommodate for this. But the point of that Value section is to show that SSDs can write much more data than can be written to an HDD in a 5 year lifespan.

      • Doug Rollins says:

        SSD vendors should disclose the conditions under which stated TBW numbers are validated. This tends to be a bit more precise because it is more direct – if a write amp value was stated instead, one would have to ‘derive’ the workload that induced that write amp, then see how close to the intended workload it was. Instead – look for explicit statements, like “4K 100% random write, 100% full” or other such precision. But as Eric noted – TBW is usually a warranty-related statement. Also as Eric noted – if we compare the TBW of most any Enterprise SSD to the amount of data an HDD can write during its normal life cycle (assuming a platform refresh every 5 years), the difference is clear: the SSD can do more useful work no matter how it is measured. If we look at useful work before we reach TBW or if we intentionally ‘slow’ the SSD so it reaches TBW in 5 years, the net useful work done by the SSD eclipses that done by the HDD.

  2. hubbert says:

    Great article, in simple terms:
    theoretical – thru a 6Gb/s pipe, 80/20 read/write ratio, how long will it take to fill a 1TB drive 20,000 times?
    empirical – all enterprise SSD vendors offer a warranty, if wear-out really imposed a warranty exposure, it would be reflected in warranty policy and public QA data. that hasn’t happened.

    • Eric Slack says:

      Hello Hubbert – thanks for the comment.
      For your “theoretical” question at a 20% write ratio (only writes cause wear) we’ve got a 1.2Gb/s data rate. When we divide that into 20,000 TB we get 16.67 million seconds or 4.67 thousand hours, or about 6 months.
      Your “empirical” comment is exactly the point of the article – SSD vendors know how long their stuff lasts and good vendors will warranty it. You just have to be confident about who you’re buying from.

  3. […] You can learn more about this by reading Eric's article by clicking here. […]

  4. Bryan says:

    You will still have unexpected SSD failures due to FW corruption and/or electronic components that fail. That being said, I think that SSDs can be more reliable that HDDs.

    • Eric Slack says:

      You’re right Bryan, there are external factors that increase failure rates. But to your point, HDDs have components that can fail as well. The no moving parts aspect to SSDs is a huge advantage here.

  5. […] State Drives (SSD).   Predictability   As it has been pointed out in a recent post by the Storage Swiss, the SSD is a very predictable technology. For most of us in the computer industry, we are not […]

  6. […] In many cases, it can overcome some of these scenarios. You can read more about this on the Storage Swiss blog.   I am continuing with my personal belief that SMART is worthless. It is attempting to […]

Comments are closed.

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 21,854 other followers

Blog Stats
  • 1,231,056 views
%d bloggers like this: