View Single Post
  #2  
Old October 16th 14, 07:37 PM posted to comp.sys.ibm.pc.hardware.storage
VanguardLH[_2_]
external usenet poster
 
Posts: 1,453
Default SSD life self monitoring question

Mark F wrote:

Do any SSDs use any pages to monitor the expected life of the product.

Pages in various physical locations could be set to known values.
These pages would not be refreshed by the usual periodic rewrites
or moving.

As the device had data written to it, additional pages would start
to be used for monitoring. The addition pages would be selected
by virtue of having already been rewritten an interesting number
of times (Say 10%, 20%...100%, 110%... of the expected average
rewrite lifetime for pages.)

The pages being monitored would be checked every once in a while.
If "enough" pages showed "enough" decay or needed "enough"
error correction, then all of the pages that had been
rewritten that many times or more and which hadn't been refreshed
for the same length of time or more would have their data moved
or refreshed into the same location. The SSD could be divided
into areas depending on physical location on the device, and
the "extra" rewrites done in each area based on monitored pages
within the area.

Simpler alternatives:
1. only refresh a page when read error rate exceeds typical
value for pages
2. only refresh a page when read error rate indicates decay
with be lost soon compared to typical values for pages
3. refresh everything that hasn't been refreshed in some
amount of time. Perhaps this time is automatically changed based
on experience for this particular device. Perhaps the time
interval is based on the current total number of writes
for this particular device.

My question/proposal is about adding monitoring at a
finer grain than the entire device.

NOTE:
The manufacturers keep everything secret, so I
can't guess how much data loss rate would decrease,
the read speed would increase, or if the average useable
life in total data written by the user would increase or decrease.

It might be the case that refreshing everything once a month
would be enough to greatly decrease read error correction
time and greatly reduce data loss, while at the same using
only less than %10 of the life of a device.\
(10 year design life means 120 writes used.
Typical MLC life numbers higher quality devices
are 10 full write/day for 5 years = 365*10*5 = 1825 average writes
of user capacity amount. Even if you reduce take into
account over population, you still have an average of more
than 1200 writes/cell available. (These devices might actually
have an expected life of about 3000 writes/cell.)

Lower quality devices typically are rated for 5 years and
probably have a design life of 5 years also. These can
be written 0.1/day. This would indicate that expected
average writes/cell is only 180 or so, but I judging
by the press, I think that the life expected average
life is 700 so. 60 periodic rewrites might "waste"
1/3 to 1/10 of the device life. Thus, I think that
finer grained monitoring could pay for these devices.

I started thinking about this due to the
Samsung 840 EVO Performance Drop that turns out to
have been related to excessive time taken by read
error recovery of "old" data, as indicated by
trade publications.

I haven't seen a press release or page at www.samsung.com that
confirms that the problem is due to read error recovery,
but here is a pointer to a description of the patch:
https://www.samsung.com/global/busin...downloads.html
at: "Samsung SSD 840 EVO Performance Restoration Software"


You are talking about waning retentivity exhibited by magnetic storage
media. Flash memory doesn't exhibit that defect. Oxide stress on the
junctions during writes is what shortens their lifespans (and why
reserved space is used to mask the bad spots but that remapping slows
the device, too). When the reserved space gets consumed, the device
catastrophically fails. The device has wear levelling algorithms
(http://en.wikipedia.org/wiki/Solid-s...#Wear_leveling) to
exercise different junctions for writes to reduce oxide stress on any
particular junction (i.e., spread out the stress). That's why you don't
defrag an SSD device.

You are also talking about MLC NAND Flash memory. MLC (multi-layer cell)
used to up the density results in less reliable reading because of the
less distinct change in states. SLC (single-layer cell) is most reliable
but most costly. MLC gives more bits per package (i.e., you get more
bytes for your buck) at the cost of performance and reliability. At
Newegg.com, for example, you can find over 900 MLC products but only 1
SLC, and a 32GB MLC costs $60 versus a 32GB SLC at $550.

http://en.wikipedia.org/wiki/Multi-level_cell

That software you mentioned is to apply a firmware update. Then it
realigns the data per whatever change in algorithm the firmware changed.

Using your scheme, the testing would be unreliable. There would lots of
reads that succeed (with or without the correction) and then a failure.
But the failure isn't permanent and subsquent reads would succeed. MLC
means less reliable reading. That's the nature of the beast. That's
what correction algorithms are especially needed for MLC to test for
read failures. Testing one spot for rate of read failures does not
indicate what some other weaker or stronger junction may exhibit.