FlashXE® and Error Avoidance for Reliable NAND Flash Systems
There are many factors that contribute to the reliability of flash memory. These include error correction, write endurance management and power fail robustness. Hyperstone controllers achieve our renowned level of reliability with the FlashXE ecosystem which includes all of these and more.
The process starts before the system is built. We use a qualification process not only to ensure that each memory device works with our controllers but also to determine their characteristics. Lifecycle testing provides information about how these characteristics change over time.
This data is used to configure the controller firmware for each flash memory and each use case, in order to maximize the reliability and lifetime of the memory.
Error correcting code (ECC) is an important part of reliable data storage. However, we think it is more important to avoid errors in the first place and ECC should be considered a last resort.
Some of the error avoidance techniques implemented in the FlashXE ecosystem are described below.
Calibration
To read data from a flash memory cell, the output voltage has to be compared with a reference. In the case of multi-level cells, there will be multiple threshold levels for each of the different values the cell can store. Over the lifetime of the memory several factors will cause the voltage levels from the cells to change.
To ensure that the data is read correctly, the reference voltage must be adjusted to compensate for this. FlashXE controllers implement an efficient calibration process that maintains read performance throughout the lifetime of the memory.
Dynamic calibration reduces the rate of errors before the data even gets to the ECC.
RAID
An important technique to minimize the risk of data loss in mass storage devices is the use of a redundant array of independent disks (RAID). The data is distributed across several disk drives in such a way that if one drive fails the data can be recovered from the remaining drives. There are a number of strategies for achieving this such as mirroring, where the same data is written to two drives. More sophisticated techniques distribute the data across multiple drives (“striping”) and also store parity information so that the data can be recovered if one or more drives fail.
RAID was originally designed for distributing data across multiple, independent drives. However, the Hyperstone controller firmware can be used to implement RAID across multiple blocks or pages of a flash memory in order to increase reliability.
Read disturb management
One potential cause of failures in flash memories is read disturb, where repeated reads from one cell will slightly change the programmed level of adjacent cells. This could eventually cause read errors from those cells.
FlashXE uses read disturb management to avoid such errors. The controller counts the number of reads from each block and when a threshold is reached the data is rewritten to another block. The threshold for this operation is determined during characterization and adjusted during the lifetime of the memory.
Error correction
Inevitably, some errors will occur and need to be corrected. ECC codes which can correct multiple errors are used for this. To further improve the error correction capability, the controller uses a log likelihood ratio (LLR) table generated during characterization of the memory. This provides statistical information about the most likely correct values for each data bit.
If an uncorrectable error is detected when reading data, the controller will retry the read several times using different reference voltages until the data is read successfully.
For many years, Bose-Chaudhuri-Hocquenghem (BCH) codes have been standard in flash memory systems. As feature sizes shrink and multiple levels are stored in each cell, the effectiveness of ECC must increase.
Hyperstone has chosen to use a generalized concatenated code (GCC) that uses an inner BCH code with an outer Reed-Solomon code. This provides a better level of error correction and maintains the advantage that the number of correctable errors can be analytically determined. As all flash memories have a guaranteed bit error rate, the use of GCC makes it possible to guarantee a specified level of reliable operation.
The controller tracks how frequently the ECC has to correct errors in each block. When the number of corrected bits passes a “near miss” threshold level, the block is refreshed. Rewriting the correct data reduces the risk of further data errors.
To ensure that infrequently-read data is not missed by this process, the controller uses a patented dynamic data refresh background task, which reads all the data in memory to check for error levels above the threshold.
Conclusion
The range of features in FlashXE can minimize the chance of data errors and provide reliable correction when they do occur. Other functions, such as write endurance management, will maximise the lifetime of the memory.
If you require reliable storage then Hyperstone’s flash controllers and FlashXE provide the support you need.