It is a well known fact that each write takes a toll on the longevity of flash-based solid state storage (SSS), and that each particular module will eventually fail after reaching its write limit. Depending on the memory type that, failure can come at 100,000 writes (SLC) or at less than 5,000 writes (MLC). Of course these numbers are really just a guideline, so some memory areas of each substrate will wear out faster than others, just like some hard drives will fail sooner than others. It is the vendor’s responsibility to protect you from these failures and to provide maximum use of the SSS investment for as long as possible.
What Can Break?
There are three basic components of solid state storage that can fail: blocks, planes and complete chips.
Block failure: A whole section of a chip within the SSS failed.
Plane failure: A small sub-section of the chip failed.
Complete chip failure: The entire flash memory chip failed.
The occurrence of any of these failures does not mean that the entire SSS has gone bad; it merely means that a section has failed. At that point it’s up to the intelligence built into the SSS to deal with that failure.
The most common type of failure is a block failure, and most SSS handle this by using some form of error correcting code (ECC). Most enterprise SSS can also handle the third most common type of failure (multiple block failures, which amounts to an entire flash memory module failure) by using some form of RAID algorithm. However, protection strategies typically ignore the second most common type of failure, a failure of the chip or plane.
Why Protecting against Plane Failure is Important
Most SSS, especially enterprise class systems, are made up of dozens or even hundreds of flash chips. If a plane failure occurs only 1/8th of one chip is actually impacted. However, the protection strategies employed by most solid state manufacturers mark the whole chip as bad, not just the plane. This is akin to having a bad sector on a mechanical drive and requiring the operating system to mark the whole platter as bad! Of course in both cases, a manufacturer’s warranty would likely be in place to come to the user’s rescue.
While a warranty is important, the act of replacement does cause its own issues. The biggest is time. The SSS module would have to be taken out and the replacement put into service, then repopulate the drive with data. It’s important to keep in mind that SSS is typically installed in performance demanding, revenue generating, production environments and maintenance windows come very sparingly, usually at late hours when IT personnel would rather be sleeping and when you’d have to pay costly overtime rates. The issue is not just the time to physically make the change but also the time required to repopulate the SSS.
Again, these are performance sensitive environments and the applications that use them are counting on that performance to meet the demands and expectations of users. Despite the fact that these are flash-based systems, the time that it takes to copy 1 TB of mission critical, performance dependent data to them may cost the organization millions in lost productivity because of lower performance while the replacement happens. Also, while several flash controllers have mitigated the time required to write to flash based storage it’s still the slowest of operations for any device to perform (other than DRAM).
Enter Plane Level SSS Protection
The first company to do anything about plane level failures is Texas Memory Systems. Their newly patented Variable Stripe RAID™ (VSR™) allows for continued operation when a Flash plane fails allowing for longer mean time between failures (MTBF). Up until now, most enterprise SSS used a RAID-like technology to protect against chip failure where the Flash media is typically grouped into stripes containing an equal, and fixed, number of chips. If all or part of a chip fails, the RAID-like technology can be used to reconstruct the data from the failed chip and place it on a new chip from a reserve of spare chips in the SSS. This works fine until the SSS is out of spare chips. Since the number of chips per stripe cannot be altered when the system is out of spare chips the replacement process has to occur. Critically, this also means that if only a small part of the chip has failed the whole chip has to be replaced, wasting the rest of the capacity on that chip.
VSR allows the number of chips per RAID stripe to be variable, meaning that if a chip fails, the size of the stripe can be adjusted to use the remaining chips. Essentially bad chips are bypassed by VSR allowing for long life even after the SSS is out of spare chips. The other interesting component to VSR is the level of granularity of the stripe, meaning that if only a small part of the chip fails, the rest of the capacity of that chip can still be used. The combination of both of these capabilities leads to a significantly longer MTBF and a lower likelihood of going through the costly replacement process.
Performance sensitive, production systems that upgrade to memory-based storage quickly become dependent on that storage for operation. Having to “fail” to mechanical storage while a replacement of the SSS is made is essentially equivalent to being down. In addition, there is a laundry list of things that can go wrong during any product replacement. The better option is to leverage self-healing storage technologies like VSR that automatically work around failed components and help you avoid replacing the device in the first place.
Texas Memory Systems is a client of Storage Switzerland