Flash Memory Reliability: The Expected and the Unexpected. Part 2. XIV Conference of the USENIX Association. File Storage Technologies

Flash Memory Reliability: The Expected and the Unexpected. Part 1. XIV Conference of the USENIX Association. File Storage Technologies

4.2.2. RBER and disk age (excluding PE cycles).

Figure 1 shows a significant correlation between RBER and age, which is equal to the number of months the drive has been in the field. However, this may be a false correlation, since it is likely that older disks have more PEs and therefore RBER is more related to PE cycles.

In order to eliminate the influence of age on the wear caused by PE cycles, we grouped all months of operation into containers, using the deciles of the distribution of the PE cycle as a cutoff between containers, for example, the first container contains all the months of operation of the disk up to the first decile of the distribution of the PE cycle, and so Further. We checked that within each container, the correlation between PE and RBER cycles is rather small (because each container covers only a small range of PE cycles), and then calculated the correlation coefficient between RBER and disk age separately for each container.

We performed this analysis separately for each model, because any observed correlations are not due to differences between the younger and older models, but solely due to the age of the disks of the same model. We observed that even after limiting the effect of PE cycles in the manner described above, for all drive models, there was still a significant correlation between the number of months of operation of the drive in the field and its RBER (correlation coefficients ranged from 0,2 to 0,4).

Flash Memory Reliability: The Expected and the Unexpected. Part 2. XIV Conference of the USENIX Association. File Storage Technologies
Rice. 3. The relationship between RBER and the number of PE cycles for new and old disks shows that the age of the disk affects the RBER value regardless of the PE cycles caused by wear.

We also graphically visualized the effect of drive age by separating the days of drive use at a β€œyoung” age up to 1 year and the days of use of a drive older than 4 years, after which we plotted the RBER of each group against the number of PE cycles. Figure 3 shows these results for the MLC-D drive model. We see a noticeable difference in RBER values ​​between groups of old and new disks throughout all PE cycles.

From this we conclude that age, measured in days of field use, has a significant impact on RBER, regardless of memory cell wear due to PE cycles. This means that other causes, such as silicon aging, play a large role in the physical wear of the disk.

4.2.3. RBER and workload.

Bit errors are believed to be caused by one of four mechanisms:

  1. retention errors, when a memory cell loses data over time
    Read disturb errors, in which a read operation corrupts the contents of an adjacent cell;
  2. Write disturb errors, in which a read operation corrupts the contents of an adjacent cell;
  3. Incomplete erase errors, when the erase operation does not completely delete the contents of the cell.

The last three types of errors (read disturb, write disturb, incomplete erase) correlate with workload, so understanding the correlation between RBER and workload helps us understand the prevalence of different error mechanisms. In a recent study "A large-scale study of flash memory failures in the field" (MEZA, J., WU, Q., KUMAR, S., MUTLU, O. "A large-scale study of flash memory failures in the field". In Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, New York, 2015, SIGMETRICS '15, ACM, pp. 177–190) concluded that storage errors dominate in the field, while reading violation errors are quite minor.

Figure 1 shows a significant relationship between the RBER value in a given month of disk life and the number of reads, writes, and erases in the same month for some models (for example, the correlation coefficient is higher than 0,2 for the MLC-B model and higher than 0,6 for the MLC-B model). SLC-B). However, this may be a false correlation, as the monthly workload may be related to the total number of PE cycles.

We used the same methodology as described in Section 4.2.2 to isolate workload effects from the effects of PE cycles by separating months of drive operation based on previous PE cycles and then determining the correlation coefficients separately for each container.

We have seen that the correlation between the number of reads in a given month of disk life and the RBER value in that month is maintained for the MLC-B and SLC-B models, even when PE cycles are limited. We also repeated a similar analysis where we excluded the effect of read operations on the number of concurrent writes and erases, and concluded that the correlation between RBER and the number of read operations is preserved for the SLC-B model.

Figure 1 also shows the correlation between RBER and writes and erases, so we repeated the same analysis for reads, writes, and erases. We concluded that by limiting the impact of PE cycles and reads, there is no relationship between the RBER value and the number of writes and erases.

Thus, there are disk models where read violation errors have a significant impact on RBER. On the other hand, there is no evidence that RBER is affected by write violation errors and incomplete erasure errors.

4.2.4 RBER and lithography.

Differences in object size may partly explain differences in RBER values ​​between disk models using the same technology, ie MLC or SLC. (See Table 1 for an overview of the lithography of the various models involved in this study).

For example, 2 SLC models with 34nm lithography (models SLC-A and SLC-D) have an RBER that is an order of magnitude higher than 2 models with 50nm microelectronic lithography (models SLC-B and SLC-C). In the case of the MLC models, only the 43nm model (MLC-B) has a median RBER that is 50% higher than the other 3 models with 50nm lithography. Moreover, this difference in RBER increases by a factor of 4 as the drives wear, as shown in Figure 2. Finally, the thinner lithography may explain the higher RBER of eMLC drives compared to MLC drives. Overall, we have received clear evidence that lithography affects RBER.

4.2.5. Presence of other errors.

We investigated the relationship between RBER and other types of errors, such as uncorrectable errors, timeout errors, etc., in particular, whether the RBER value will become higher after a month of exposure to other types of errors.

Figure 1 shows that while the previous month's RBER value predicts future RBER values ​​(correlation coefficient above 0,8), there is no significant correlation between uncorrectable errors and RBER (far right group of elements in Fig. 1). For other types of errors, the correlation coefficient is even lower (not shown in the figure). We continued our exploration of the relationship between RBER and uncorrectable errors in section 5.2 of this article.

4.2.6. Influence of other factors.

We found evidence that there are factors that have a significant impact on RBER that could not be accounted for by our data. In particular, we noticed that the RBER for a particular disk model varies depending on the cluster in which the disk is deployed. A good example is Figure 4, which shows RBER versus PE cycles for model MLC-D drives in three different clusters (dashed lines) and compares it with RBER for this model relative to the total number of disks (solid line). We believe these differences persist even when we limit the impact of factors such as disk age or reads.

One possible explanation for this factor is the differences in workload type across clusters, as we observe that clusters whose workload has the highest read/write ratios have the highest RBER.

Flash Memory Reliability: The Expected and the Unexpected. Part 2. XIV Conference of the USENIX Association. File Storage Technologies
Rice. 4 a), b). Median RBER versus PE cycles across three different clusters and read/write ratio vs. PE cycles across three different clusters.

For example, Figure 4(b) shows the read/write ratios of different clusters for the MLC-D drive model. However, the read/write ratio does not explain the differences between clusters for all models, so there may be other factors that our data does not account for, such as environmental factors or other external workload parameters.

4.3. RBER during accelerated durability testing.

Most scientific papers, as well as tests conducted in commercial media procurement, predict the reliability of devices in the field based on the results of accelerated durability tests. We decided to find out how the results of such tests correspond to the practical experience of operating solid-state storage media.
An analysis of the results of tests conducted according to the general accelerated test methodology for equipment supplied to Google data centers showed that field RBER values ​​were significantly higher than predicted. For example, for the eMLC-a model, the median RBER for drives operated in the field (at the end of the test, the number of PE cycles reached 600) was 1e-05, while according to the results of preliminary accelerated testing, this RBER value should correspond to more than 4000 PE cycles. This indicates that it is very difficult to accurately predict RBER in the field based on RBER estimates from laboratory tests.

We also noted that some types of errors are difficult to reproduce during accelerated tests. For example, in the case of the MLC-B model, nearly 60% of drives in the field have uncorrectable errors and nearly 80% of drives have bad blocks. However, during accelerated endurance testing, none of the six devices experienced any uncorrectable errors until the drives reached more than three times the PE cycle limit. For eMLC models, more than 80% of drives experienced uncorrectable errors in the field, while during accelerated testing, such errors occurred after reaching 15000 PE cycles.

We also looked at the RBER described in a previous research paper, which was based on experiments in a controlled environment, and concluded that the range of values ​​is extremely high. For example, L.M. Grupp and others in their papers from 2009-2012 indicate RBER values ​​for disks that are close to reaching the PE cycle limits. For example, for SLC and MLC devices with a lithography size similar to that used in our work (25-50nm), the RBER value ranges from 1e-08 to 1e-03, with the RBER value close to 1e-06 for most of the tested drive models.

In our study, the three drive models that reached the PE cycle limit had an RBER ranging from 3e-08 to 8e-08. Even taking into account that our numbers are lower bounds and in the absolute worst case, they can take values ​​\u16b\u95bof XNUMX times higher, or taking into account the XNUMXth percentile of RBER, the values ​​\uXNUMXb\uXNUMXbthat we get are still significantly lower.

In general, while actual RBER values ​​in the field are higher than predicted values ​​based on accelerated life tests, they are still lower than most RBER for similar devices reported in other research papers and calculated from laboratory tests. This means that you should not rely on predicted RBER values ​​in the field that have been derived from the results of accelerated durability tests.

5. Unrecoverable errors.

Given the widespread use of Uncorrectable Errors (UEs), which were discussed in Section 3 of this article, we explore their characteristics in more detail in this section. We start by discussing which metric to use to measure UE, what their relationship is to RBER, and how UE is affected by various factors.

5.1. Why UBER doesn't make sense.

The standard metric that characterizes uncorrectable errors is the UBER, that is, the ratio of the number of uncorrectable bit errors to the total number of bits read.

This metric implicitly assumes that the number of uncorrectable errors is somehow tied to the number of bits read, and therefore must be normalized by this number.

This assumption is valid for correctable errors, where it is found that the number of errors observed in a given month is highly correlated with the number of reads in the same time period (Spearman's correlation coefficient is greater than 0.9). The reason for such a strong correlation is that even a single corrupted bit, as long as it is corrected by ECC, will continue to increase the number of errors with each read operation addressed to it, since the evaluation of the cell containing the corrupted bit is not immediately corrected when an error is detected (disks only periodically rewrite pages with broken bits).

The same assumption does not hold for uncorrectable errors. An uncorrectable error excludes the further use of a damaged block, therefore, once detected, such a block will not affect the number of errors in the future.

To officially confirm this assumption, we used various metrics to measure the relationship between the number of reads in a given month of disk operation and the number of uncorrectable errors in the same period of time, including various correlation coefficients (Pearson, Spearman, Kendall), as well as visual examination of graphs . In addition to the number of uncorrectable errors, we also looked at the frequency of uncorrectable error incidents (for example, the probability that a drive will have at least one such incident in a given period of time) and their relationship to reads.
We found no evidence of a correlation between the number of reads and the number of uncorrectable errors. For all drive models, the correlation coefficients were below 0.02 and the graphs did not show any increase in UE as the number of reads increased.

In Section 5.4 of this article, we consider that writes and erases also have nothing to do with uncorrectable errors, so the alternative definition of UBER, which is normalized by writes or erases instead of reads, is irrelevant.

Therefore, we conclude that UBER is not a meaningful metric, except perhaps for testing in controlled environments where the number of reads is set by the experimenter. If UBER is used as a metric during field testing, it will artificially reduce the error rate for drives with high read counts and artificially increase the error rate for drives with low read counts, since uncorrectable errors occur regardless of the number of reads.

5.2. Uncorrectable errors and RBER.

The relevance of RBER is explained by the fact that it serves as a measure of determining the overall reliability of the drive, in particular, based on the probability of occurrence of uncorrectable errors. In their work, N. Mielke and others in 2008 were the first to propose to determine the expected rate of uncorrectable errors as a function of RBER. Since then, many system designers have used similar methods, such as estimating the expected uncorrectable error rate as a function of RBER and ECC type.

The purpose of this section is to characterize how well RBER predicts uncorrectable errors. Let's start with Figure 5a, which plots the median RBER for a number of first generation drive models against the fraction of days they were in operation that had uncorrectable UE errors. It should be noted that some of the 16 models shown in the graph are not included in Table 1 due to a lack of analytical information.

Flash Memory Reliability: The Expected and the Unexpected. Part 2. XIV Conference of the USENIX Association. File Storage Technologies
Rice. 5a. Relationship between median RBER and uncorrectable errors for various drive models.

Flash Memory Reliability: The Expected and the Unexpected. Part 2. XIV Conference of the USENIX Association. File Storage Technologies
Rice. 5b. Correlation of median RBER with uncorrectable errors for different drives of the same model.

Recall that all models within the same generation use the same ECC mechanism, so differences between models are independent of ECC differences. We did not see a correlation between RBER and UE incidents. We created the same plot for the 95th percentile of RBER versus UE and again saw no correlation.

Next, we repeated the analysis in detailing individual disks, i.e., we tried to find out if there are disks where a higher RBER value corresponds to a higher UE frequency. As an example, Figure 5b plots the median RBER for each MLC-c drive against the number of UEs (similar results to the 95th percentile RBER). Again, we did not see any correlation between RBER and UE.

Finally, we performed a more accurate time analysis to see if the months of operation of drives with higher RBER would correspond to months during which UEs occurred. Figure 1 has already indicated that the correlation coefficient between uncorrectable errors and RBER is very low. We also experimented with different ways of plotting the probability of UE as a function of RBER and did not find any sign of a correlation.

Thus, we concluded that RBER is not a reliable metric for UE prediction. This may mean that the failure mechanisms leading to RBER are different from those leading to uncorrectable errors (eg, errors contained in individual cells versus larger problems occurring with the entire device).

5.3. Uncorrectable errors and wear and tear.

Since wear and tear is one of the major issues with flash memory, Figure 6 shows the daily probability of drive uncorrectable errors versus PE cycles.

Flash Memory Reliability: The Expected and the Unexpected. Part 2. XIV Conference of the USENIX Association. File Storage Technologies
Figure 6. Daily probability of uncorrectable drive errors versus PE cycles.

We note that the probability of UE increases continuously with the age of the drive. However, as with RBER, the growth is slower than is usually assumed: the graphs show that UEs grow linearly with PE cycles, not exponentially.

The two conclusions we made for RBER also apply to UEs: first, there is no clear increase in the possibility of errors after the PE cycle limit is reached, such as in Figure 6 for the MLC-D model, whose PE cycle limit is 3000. secondly, the error rate varies for different models, even within the same class. However, these differences are not as large as for RBER.

Finally, in support of our findings in section 5.2, we found that within the same class of models (MLC vs. SLC), the models with the lowest RBER values ​​for a given number of PE cycles are not necessarily those with the lowest probability of UE occurrence. For example, over 3000 PE cycles, MLC-D models had RBER values ​​4 times lower than MLC-B models, however, the probability of UE with the same number of PE cycles for MLC-D models was slightly higher than for MLC-B models.

Flash Memory Reliability: The Expected and the Unexpected. Part 2. XIV Conference of the USENIX Association. File Storage Technologies
Figure 7. Monthly probability of occurrence of uncorrectable drive errors as a function of the presence of previous errors of various types.

5.4. Uncorrectable errors and workload.

For the same reasons that workload can affect RBER (see section 4.2.3), it can also be expected to affect UEs. For example, since we have observed that read violation errors affect RBER, reads can also increase the chance of uncorrectable errors.

We have conducted a detailed study of the workload impact on UEs. However, as noted in Section 5.1, we did not find a relationship between UE and the number of reads. We repeated the same analysis for write and erase operations and again saw no correlation.
Note that at first glance, this seems to contradict our previous observation that uncorrectable errors correlate with PE cycles. Therefore, one might well expect a correlation with the number of writes and erases.

However, in our analysis of the impact of PE cycles, we compared the number of uncorrectable errors in a given month with the total number of PE cycles the drive has experienced over its lifetime to date in order to measure the effect of wear. When looking at the workload impact, we looked at the months of drive operation that had the most reads/writes/erases in a particular month, which also had a higher chance of uncorrectable errors, i.e. we didn't take into account the total number of reads/writes/erases. erasing.

As a result, we concluded that read violation errors, write violation errors, and incomplete erasure errors are not the main factors in the development of uncorrectable errors.

Thank you for staying with us. Do you like our articles? Want to see more interesting content? Support us by placing an order or recommending to friends, 30% discount for Habr users on a unique analogue of entry-level servers, which was invented by us for you: The whole truth about VPS (KVM) E5-2650 v4 (6 Cores) 10GB DDR4 240GB SSD 1Gbps from $20 or how to share a server? (available with RAID1 and RAID10, up to 24 cores and up to 40GB DDR4).

Dell R730xd 2 times cheaper? Only here 2 x Intel TetraDeca-Core Xeon 2x E5-2697v3 2.6GHz 14C 64GB DDR4 4x960GB SSD 1Gbps 100 TV from $199 in the Netherlands! Dell R420 - 2x E5-2430 2.2Ghz 6C 128GB DDR3 2x960GB SSD 1Gbps 100TB - from $99! Read about How to build infrastructure corp. class with the use of Dell R730xd E5-2650 v4 servers worth 9000 euros for a penny?

Source: habr.com

Add a comment