If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. |
|
|
Thread Tools | Display Modes |
#1
|
|||
|
|||
SSD life self monitoring question
Do any SSDs use any pages to monitor the expected life of the product.
Pages in various physical locations could be set to known values. These pages would not be refreshed by the usual periodic rewrites or moving. As the device had data written to it, additional pages would start to be used for monitoring. The addition pages would be selected by virtue of having already been rewritten an interesting number of times (Say 10%, 20%...100%, 110%... of the expected average rewrite lifetime for pages.) The pages being monitored would be checked every once in a while. If "enough" pages showed "enough" decay or needed "enough" error correction, then all of the pages that had been rewritten that many times or more and which hadn't been refreshed for the same length of time or more would have their data moved or refreshed into the same location. The SSD could be divided into areas depending on physical location on the device, and the "extra" rewrites done in each area based on monitored pages within the area. Simpler alternatives: 1. only refresh a page when read error rate exceeds typical value for pages 2. only refresh a page when read error rate indicates decay with be lost soon compared to typical values for pages 3. refresh everything that hasn't been refreshed in some amount of time. Perhaps this time is automatically changed based on experience for this particular device. Perhaps the time interval is based on the current total number of writes for this particular device. My question/proposal is about adding monitoring at a finer grain than the entire device. NOTE: The manufacturers keep everything secret, so I can't guess how much data loss rate would decrease, the read speed would increase, or if the average useable life in total data written by the user would increase or decrease. It might be the case that refreshing everything once a month would be enough to greatly decrease read error correction time and greatly reduce data loss, while at the same using only less than %10 of the life of a device.\ (10 year design life means 120 writes used. Typical MLC life numbers higher quality devices are 10 full write/day for 5 years = 365*10*5 = 1825 average writes of user capacity amount. Even if you reduce take into account over population, you still have an average of more than 1200 writes/cell available. (These devices might actually have an expected life of about 3000 writes/cell.) Lower quality devices typically are rated for 5 years and probably have a design life of 5 years also. These can be written 0.1/day. This would indicate that expected average writes/cell is only 180 or so, but I judging by the press, I think that the life expected average life is 700 so. 60 periodic rewrites might "waste" 1/3 to 1/10 of the device life. Thus, I think that finer grained monitoring could pay for these devices. I started thinking about this due to the Samsung 840 EVO Performance Drop that turns out to have been related to excessive time taken by read error recovery of "old" data, as indicated by trade publications. I haven't seen a press release or page at www.samsung.com that confirms that the problem is due to read error recovery, but here is a pointer to a description of the patch: https://www.samsung.com/global/busin...downloads.html at: "Samsung SSD 840 EVO Performance Restoration Software" |
#2
|
|||
|
|||
SSD life self monitoring question
Mark F wrote:
Do any SSDs use any pages to monitor the expected life of the product. Pages in various physical locations could be set to known values. These pages would not be refreshed by the usual periodic rewrites or moving. As the device had data written to it, additional pages would start to be used for monitoring. The addition pages would be selected by virtue of having already been rewritten an interesting number of times (Say 10%, 20%...100%, 110%... of the expected average rewrite lifetime for pages.) The pages being monitored would be checked every once in a while. If "enough" pages showed "enough" decay or needed "enough" error correction, then all of the pages that had been rewritten that many times or more and which hadn't been refreshed for the same length of time or more would have their data moved or refreshed into the same location. The SSD could be divided into areas depending on physical location on the device, and the "extra" rewrites done in each area based on monitored pages within the area. Simpler alternatives: 1. only refresh a page when read error rate exceeds typical value for pages 2. only refresh a page when read error rate indicates decay with be lost soon compared to typical values for pages 3. refresh everything that hasn't been refreshed in some amount of time. Perhaps this time is automatically changed based on experience for this particular device. Perhaps the time interval is based on the current total number of writes for this particular device. My question/proposal is about adding monitoring at a finer grain than the entire device. NOTE: The manufacturers keep everything secret, so I can't guess how much data loss rate would decrease, the read speed would increase, or if the average useable life in total data written by the user would increase or decrease. It might be the case that refreshing everything once a month would be enough to greatly decrease read error correction time and greatly reduce data loss, while at the same using only less than %10 of the life of a device.\ (10 year design life means 120 writes used. Typical MLC life numbers higher quality devices are 10 full write/day for 5 years = 365*10*5 = 1825 average writes of user capacity amount. Even if you reduce take into account over population, you still have an average of more than 1200 writes/cell available. (These devices might actually have an expected life of about 3000 writes/cell.) Lower quality devices typically are rated for 5 years and probably have a design life of 5 years also. These can be written 0.1/day. This would indicate that expected average writes/cell is only 180 or so, but I judging by the press, I think that the life expected average life is 700 so. 60 periodic rewrites might "waste" 1/3 to 1/10 of the device life. Thus, I think that finer grained monitoring could pay for these devices. I started thinking about this due to the Samsung 840 EVO Performance Drop that turns out to have been related to excessive time taken by read error recovery of "old" data, as indicated by trade publications. I haven't seen a press release or page at www.samsung.com that confirms that the problem is due to read error recovery, but here is a pointer to a description of the patch: https://www.samsung.com/global/busin...downloads.html at: "Samsung SSD 840 EVO Performance Restoration Software" You are talking about waning retentivity exhibited by magnetic storage media. Flash memory doesn't exhibit that defect. Oxide stress on the junctions during writes is what shortens their lifespans (and why reserved space is used to mask the bad spots but that remapping slows the device, too). When the reserved space gets consumed, the device catastrophically fails. The device has wear levelling algorithms (http://en.wikipedia.org/wiki/Solid-s...#Wear_leveling) to exercise different junctions for writes to reduce oxide stress on any particular junction (i.e., spread out the stress). That's why you don't defrag an SSD device. You are also talking about MLC NAND Flash memory. MLC (multi-layer cell) used to up the density results in less reliable reading because of the less distinct change in states. SLC (single-layer cell) is most reliable but most costly. MLC gives more bits per package (i.e., you get more bytes for your buck) at the cost of performance and reliability. At Newegg.com, for example, you can find over 900 MLC products but only 1 SLC, and a 32GB MLC costs $60 versus a 32GB SLC at $550. http://en.wikipedia.org/wiki/Multi-level_cell That software you mentioned is to apply a firmware update. Then it realigns the data per whatever change in algorithm the firmware changed. Using your scheme, the testing would be unreliable. There would lots of reads that succeed (with or without the correction) and then a failure. But the failure isn't permanent and subsquent reads would succeed. MLC means less reliable reading. That's the nature of the beast. That's what correction algorithms are especially needed for MLC to test for read failures. Testing one spot for rate of read failures does not indicate what some other weaker or stronger junction may exhibit. |
#3
|
|||
|
|||
SSD life self monitoring question
(This is meant as a reply to VanguardLH, rather than a post
to the newsgroup, but he didn't supply an email address.) On Thu, 16 Oct 2014 13:37:54 -0500, VanguardLH wrote: (I've keep everything together so that there is no need to look for old pieces of the discussion.) Mark F wrote: Do any SSDs use any pages to monitor the expected life of the product. Pages in various physical locations could be set to known values. These pages would not be refreshed by the usual periodic rewrites or moving. As the device had data written to it, additional pages would start to be used for monitoring. The addition pages would be selected by virtue of having already been rewritten an interesting number of times (Say 10%, 20%...100%, 110%... of the expected average rewrite lifetime for pages.) The pages being monitored would be checked every once in a while. If "enough" pages showed "enough" decay or needed "enough" error correction, then all of the pages that had been rewritten that many times or more and which hadn't been refreshed for the same length of time or more would have their data moved or refreshed into the same location. The SSD could be divided into areas depending on physical location on the device, and the "extra" rewrites done in each area based on monitored pages within the area. Simpler alternatives: 1. only refresh a page when read error rate exceeds typical value for pages 2. only refresh a page when read error rate indicates decay with be lost soon compared to typical values for pages 3. refresh everything that hasn't been refreshed in some amount of time. Perhaps this time is automatically changed based on experience for this particular device. Perhaps the time interval is based on the current total number of writes for this particular device. My question/proposal is about adding monitoring at a finer grain than the entire device. NOTE: The manufacturers keep everything secret, so I can't guess how much data loss rate would decrease, the read speed would increase, or if the average useable life in total data written by the user would increase or decrease. It might be the case that refreshing everything once a month would be enough to greatly decrease read error correction time and greatly reduce data loss, while at the same using only less than %10 of the life of a device.\ (10 year design life means 120 writes used. Typical MLC life numbers higher quality devices are 10 full write/day for 5 years = 365*10*5 = 1825 average writes of user capacity amount. Even if you reduce take into account over population, you still have an average of more than 1200 writes/cell available. (These devices might actually have an expected life of about 3000 writes/cell.) Lower quality devices typically are rated for 5 years and probably have a design life of 5 years also. These can be written 0.1/day. This would indicate that expected average writes/cell is only 180 or so, but I judging by the press, I think that the life expected average life is 700 so. 60 periodic rewrites might "waste" 1/3 to 1/10 of the device life. Thus, I think that finer grained monitoring could pay for these devices. I started thinking about this due to the Samsung 840 EVO Performance Drop that turns out to have been related to excessive time taken by read error recovery of "old" data, as indicated by trade publications. I haven't seen a press release or page at www.samsung.com that confirms that the problem is due to read error recovery, but here is a pointer to a description of the patch: https://www.samsung.com/global/busin...downloads.html at: "Samsung SSD 840 EVO Performance Restoration Software" You are talking about waning retentivity exhibited by magnetic storage media. No, I am talking about flash memory. The charges leak away over time and the speed of leaking away from a given cell increases as the number of writes to the cell increases. Typical hidden "specs" that manufacturers used circa 2010 was that end of life for a device was when data would be lost after 1 year of unpowered storage. My guess is many consumer devices now are designed for less than one year life. I also guess that most manufacturers assume that SSDs, as contrasted with flash memory keys, are always powered on. Thus the powered down data retention time for SSDs may be less than for flash memory keys. This may apply to both consumer and "enterprise" devices. Flash memory doesn't exhibit that defect. Yes, the mechanism is different, but the limited storage time happens for both. Note that in 1995 15 or even 100 years was considered the unpowered storage retention time. Also, the number of write cycles had been gradually increasing from a few 100 to 100000 or even more, even though the memory was getting denser By 2010 (or perhaps a few years before) 1 year powered off retention was considered end of life for the flash memory key or whatever. Oxide stress on the junctions during writes is what shortens their lifespans (and why reserved space is used to mask the bad spots but that remapping slows the device, too). When the reserved space gets consumed, the device catastrophically fails. The device has wear levelling algorithms (http://en.wikipedia.org/wiki/Solid-s...#Wear_leveling) to exercise different junctions for writes to reduce oxide stress on any particular junction (i.e., spread out the stress). Wear leveling is to spread out the stress of writing, not the stress of charge storage. Rewriting of data serves to refresh the data. Many devices periodically scan, looking for data that needs refreshing. That's why you don't defrag an SSD device. I don't think it is true that you should never defrag and SSD (or a flash memory key for that matter, but let us just talk about SSDs.) Why would you defrag: to increase read speed by reducing seek delays and the number of I/O operations for large transfers. I think if you look at the performance of most consumer SSDs you will find that they in fact do act as if they have seek times. (It is possible that the "seek" times are due to extra operations within the device because the data is not contiguous even though it appears contiguous from the user point of view. There may be other factors affecting the speed of the device, but they do in fact look like "seek" times") If the user view is fragmented there will be more overhead in the operating system and more I/O operations to the device. Current defrag programs get rid of the extra I/O operations, but not the pseudo-"seek" times from device. I have suggested that someone work with manufacturers to make a defrag program that defrags the data as actually stored in SSDs. Well, you say, spinning disks have 10 milliseconds access times, consumer SSDs are have 0.1 milliseconds access time, so why bother? The answer is that many consumer SSDs will jump to 0.2 or even 0.3 milliseconds access times as things get defragmented. This is likely to reduce performance to 1/2 or even 1/3 of the ideal performance. I feel that for consumer products the possible gain by defraging, even using a program that just defrags the user view, is worth it if I have seen a performance drop. Things probably won't get any more fragmented on the device and might get someone defragmented. My preferred technique is to copy to a new device for data disks since it lets me put aside the old device as a backup, defragments things, and decreases the size of the NTFS Master File Table. Most SSDs also do a pretty good job of keeping the fragmentation on the device low also. (For system disks, might do a clone, defrag the clone, then clone to a third device to use as my new system disk. The extra clone operation is so that if the defrag messes up I still have the original.) My disks (spinning and system) have lots of free space so: 1. they don't get very fragmented, so typically I defragment less often than 2 times per year. 2. defragmenting programs don't have to write the data more than a time or two. (I don't know what the Write Amplification Factor is.) (2 times/year* WAF 3 * new disks every 3 years) = about 18 writes Worse case life that I have heard of for consumer stuff is about 700, so I don't expect 18 extra, of even 108 extra (for 1/month) to be a big deal. So far, I only use SSDs for my system disks and testing. I use SpinRite on my backup drives every 6 months, but I am getting concerned that 6 months may be too long a time for consumer SSDs in a powered off state. I use about 20 SSDs for backups and about 200 spinning drives for backups. I don't know the performance numbers for "enterprise" SSDs. Will, for example, fragmentation reduce the number of I/Os from 200000 per second to 100000 or even 75000 per second, or will the number remain above 150000 per second? You are also talking about MLC NAND Flash memory. MLC (multi-layer cell) used to up the density results in less reliable reading because of the less distinct change in states. SLC (single-layer cell) is most reliable but most costly. MLC gives more bits per package (i.e., you get more bytes for your buck) at the cost of performance and reliability. The cost to chip manufacturers per bit MLC is about 1/2 that of SLC. The cost to chip manufacturers per bit TLC is about 1/3 that of SLC. However, typically price to users for SLC is about 10 times the price to users of TLC. "enterprise" stuff costs users more per bit, but I haven't looked at the ratios between SLC, MLC, and TLC for users. At Newegg.com, for example, you can find over 900 MLC products but only 1 SLC, and a 32GB MLC costs $60 versus a 32GB SLC at $550. http://en.wikipedia.org/wiki/Multi-level_cell That software you mentioned is to apply a firmware update. Then it realigns the data per whatever change in algorithm the firmware changed. Yes. Using your scheme, the testing would be unreliable. There would lots of reads that succeed (with or without the correction) and then a failure. But the failure isn't permanent and subsquent reads would succeed. MLC means less reliable reading. That's the nature of the beast. That's what correction algorithms are especially needed for MLC to test for read failures. Testing one spot for rate of read failures does not indicate what some other weaker or stronger junction may exhibit. I indicated that there might be nothing to be gained by monitoring each chip, each array, or even at finer granularity. I just thought that the manufacturers should consider how local the monitoring has to be. My main point, however, was that looking at the (correctable) error rate that is being seen is not good enough: I feel that a better device lifetime estimate can be made by seeing how retention time varies with the number of writes a given location on the actual device, not just other devices with the same technology or same batch. With rewrite lifetimes of 100000 and increasing with new technology and retention time staying about constant at more than the expected time the that device needed to live in hardware, batch or process parameters were fine. Which 700 cycles and decreasing, with expected power off retention time 1 year or less and decreasing, closer monitoring is needed. Using 1% of the device for test cells to find problems early, even though it would reduce spares from 10% or 20% to 9% or 19%. %1 of 1TB is 10GB, so it sounds like a lot, but it isn't really. |
Thread Tools | |
Display Modes | |
|
|