A computer components & hardware forum. HardwareBanter

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Go Back   Home » HardwareBanter forum » General Hardware & Peripherals » Storage (alternative)
Site Map Home Register Authors List Search Today's Posts Mark Forums Read Web Partners

SSD life self monitoring question



 
 
Thread Tools Display Modes
  #1  
Old October 16th 14, 03:44 PM posted to comp.sys.ibm.pc.hardware.storage
Mark F[_2_]
external usenet poster
 
Posts: 164
Default SSD life self monitoring question

Do any SSDs use any pages to monitor the expected life of the product.

Pages in various physical locations could be set to known values.
These pages would not be refreshed by the usual periodic rewrites
or moving.

As the device had data written to it, additional pages would start
to be used for monitoring. The addition pages would be selected
by virtue of having already been rewritten an interesting number
of times (Say 10%, 20%...100%, 110%... of the expected average
rewrite lifetime for pages.)

The pages being monitored would be checked every once in a while.
If "enough" pages showed "enough" decay or needed "enough"
error correction, then all of the pages that had been
rewritten that many times or more and which hadn't been refreshed
for the same length of time or more would have their data moved
or refreshed into the same location. The SSD could be divided
into areas depending on physical location on the device, and
the "extra" rewrites done in each area based on monitored pages
within the area.

Simpler alternatives:
1. only refresh a page when read error rate exceeds typical
value for pages
2. only refresh a page when read error rate indicates decay
with be lost soon compared to typical values for pages
3. refresh everything that hasn't been refreshed in some
amount of time. Perhaps this time is automatically changed based
on experience for this particular device. Perhaps the time
interval is based on the current total number of writes
for this particular device.

My question/proposal is about adding monitoring at a
finer grain than the entire device.

NOTE:
The manufacturers keep everything secret, so I
can't guess how much data loss rate would decrease,
the read speed would increase, or if the average useable
life in total data written by the user would increase or decrease.

It might be the case that refreshing everything once a month
would be enough to greatly decrease read error correction
time and greatly reduce data loss, while at the same using
only less than %10 of the life of a device.\
(10 year design life means 120 writes used.
Typical MLC life numbers higher quality devices
are 10 full write/day for 5 years = 365*10*5 = 1825 average writes
of user capacity amount. Even if you reduce take into
account over population, you still have an average of more
than 1200 writes/cell available. (These devices might actually
have an expected life of about 3000 writes/cell.)

Lower quality devices typically are rated for 5 years and
probably have a design life of 5 years also. These can
be written 0.1/day. This would indicate that expected
average writes/cell is only 180 or so, but I judging
by the press, I think that the life expected average
life is 700 so. 60 periodic rewrites might "waste"
1/3 to 1/10 of the device life. Thus, I think that
finer grained monitoring could pay for these devices.

I started thinking about this due to the
Samsung 840 EVO Performance Drop that turns out to
have been related to excessive time taken by read
error recovery of "old" data, as indicated by
trade publications.

I haven't seen a press release or page at www.samsung.com that
confirms that the problem is due to read error recovery,
but here is a pointer to a description of the patch:
https://www.samsung.com/global/busin...downloads.html
at: "Samsung SSD 840 EVO Performance Restoration Software"
  #2  
Old October 16th 14, 07:37 PM posted to comp.sys.ibm.pc.hardware.storage
VanguardLH[_2_]
external usenet poster
 
Posts: 1,453
Default SSD life self monitoring question

Mark F wrote:

Do any SSDs use any pages to monitor the expected life of the product.

Pages in various physical locations could be set to known values.
These pages would not be refreshed by the usual periodic rewrites
or moving.

As the device had data written to it, additional pages would start
to be used for monitoring. The addition pages would be selected
by virtue of having already been rewritten an interesting number
of times (Say 10%, 20%...100%, 110%... of the expected average
rewrite lifetime for pages.)

The pages being monitored would be checked every once in a while.
If "enough" pages showed "enough" decay or needed "enough"
error correction, then all of the pages that had been
rewritten that many times or more and which hadn't been refreshed
for the same length of time or more would have their data moved
or refreshed into the same location. The SSD could be divided
into areas depending on physical location on the device, and
the "extra" rewrites done in each area based on monitored pages
within the area.

Simpler alternatives:
1. only refresh a page when read error rate exceeds typical
value for pages
2. only refresh a page when read error rate indicates decay
with be lost soon compared to typical values for pages
3. refresh everything that hasn't been refreshed in some
amount of time. Perhaps this time is automatically changed based
on experience for this particular device. Perhaps the time
interval is based on the current total number of writes
for this particular device.

My question/proposal is about adding monitoring at a
finer grain than the entire device.

NOTE:
The manufacturers keep everything secret, so I
can't guess how much data loss rate would decrease,
the read speed would increase, or if the average useable
life in total data written by the user would increase or decrease.

It might be the case that refreshing everything once a month
would be enough to greatly decrease read error correction
time and greatly reduce data loss, while at the same using
only less than %10 of the life of a device.\
(10 year design life means 120 writes used.
Typical MLC life numbers higher quality devices
are 10 full write/day for 5 years = 365*10*5 = 1825 average writes
of user capacity amount. Even if you reduce take into
account over population, you still have an average of more
than 1200 writes/cell available. (These devices might actually
have an expected life of about 3000 writes/cell.)

Lower quality devices typically are rated for 5 years and
probably have a design life of 5 years also. These can
be written 0.1/day. This would indicate that expected
average writes/cell is only 180 or so, but I judging
by the press, I think that the life expected average
life is 700 so. 60 periodic rewrites might "waste"
1/3 to 1/10 of the device life. Thus, I think that
finer grained monitoring could pay for these devices.

I started thinking about this due to the
Samsung 840 EVO Performance Drop that turns out to
have been related to excessive time taken by read
error recovery of "old" data, as indicated by
trade publications.

I haven't seen a press release or page at www.samsung.com that
confirms that the problem is due to read error recovery,
but here is a pointer to a description of the patch:
https://www.samsung.com/global/busin...downloads.html
at: "Samsung SSD 840 EVO Performance Restoration Software"


You are talking about waning retentivity exhibited by magnetic storage
media. Flash memory doesn't exhibit that defect. Oxide stress on the
junctions during writes is what shortens their lifespans (and why
reserved space is used to mask the bad spots but that remapping slows
the device, too). When the reserved space gets consumed, the device
catastrophically fails. The device has wear levelling algorithms
(http://en.wikipedia.org/wiki/Solid-s...#Wear_leveling) to
exercise different junctions for writes to reduce oxide stress on any
particular junction (i.e., spread out the stress). That's why you don't
defrag an SSD device.

You are also talking about MLC NAND Flash memory. MLC (multi-layer cell)
used to up the density results in less reliable reading because of the
less distinct change in states. SLC (single-layer cell) is most reliable
but most costly. MLC gives more bits per package (i.e., you get more
bytes for your buck) at the cost of performance and reliability. At
Newegg.com, for example, you can find over 900 MLC products but only 1
SLC, and a 32GB MLC costs $60 versus a 32GB SLC at $550.

http://en.wikipedia.org/wiki/Multi-level_cell

That software you mentioned is to apply a firmware update. Then it
realigns the data per whatever change in algorithm the firmware changed.

Using your scheme, the testing would be unreliable. There would lots of
reads that succeed (with or without the correction) and then a failure.
But the failure isn't permanent and subsquent reads would succeed. MLC
means less reliable reading. That's the nature of the beast. That's
what correction algorithms are especially needed for MLC to test for
read failures. Testing one spot for rate of read failures does not
indicate what some other weaker or stronger junction may exhibit.
  #3  
Old October 17th 14, 07:15 PM posted to comp.sys.ibm.pc.hardware.storage
Mark F[_2_]
external usenet poster
 
Posts: 164
Default SSD life self monitoring question

(This is meant as a reply to VanguardLH, rather than a post
to the newsgroup, but he didn't supply an email address.)
On Thu, 16 Oct 2014 13:37:54 -0500, VanguardLH wrote:

(I've keep everything together so that there is no need
to look for old pieces of the discussion.)

Mark F wrote:

Do any SSDs use any pages to monitor the expected life of the product.

Pages in various physical locations could be set to known values.
These pages would not be refreshed by the usual periodic rewrites
or moving.

As the device had data written to it, additional pages would start
to be used for monitoring. The addition pages would be selected
by virtue of having already been rewritten an interesting number
of times (Say 10%, 20%...100%, 110%... of the expected average
rewrite lifetime for pages.)

The pages being monitored would be checked every once in a while.
If "enough" pages showed "enough" decay or needed "enough"
error correction, then all of the pages that had been
rewritten that many times or more and which hadn't been refreshed
for the same length of time or more would have their data moved
or refreshed into the same location. The SSD could be divided
into areas depending on physical location on the device, and
the "extra" rewrites done in each area based on monitored pages
within the area.

Simpler alternatives:
1. only refresh a page when read error rate exceeds typical
value for pages
2. only refresh a page when read error rate indicates decay
with be lost soon compared to typical values for pages
3. refresh everything that hasn't been refreshed in some
amount of time. Perhaps this time is automatically changed based
on experience for this particular device. Perhaps the time
interval is based on the current total number of writes
for this particular device.

My question/proposal is about adding monitoring at a
finer grain than the entire device.

NOTE:
The manufacturers keep everything secret, so I
can't guess how much data loss rate would decrease,
the read speed would increase, or if the average useable
life in total data written by the user would increase or decrease.

It might be the case that refreshing everything once a month
would be enough to greatly decrease read error correction
time and greatly reduce data loss, while at the same using
only less than %10 of the life of a device.\
(10 year design life means 120 writes used.
Typical MLC life numbers higher quality devices
are 10 full write/day for 5 years = 365*10*5 = 1825 average writes
of user capacity amount. Even if you reduce take into
account over population, you still have an average of more
than 1200 writes/cell available. (These devices might actually
have an expected life of about 3000 writes/cell.)

Lower quality devices typically are rated for 5 years and
probably have a design life of 5 years also. These can
be written 0.1/day. This would indicate that expected
average writes/cell is only 180 or so, but I judging
by the press, I think that the life expected average
life is 700 so. 60 periodic rewrites might "waste"
1/3 to 1/10 of the device life. Thus, I think that
finer grained monitoring could pay for these devices.

I started thinking about this due to the
Samsung 840 EVO Performance Drop that turns out to
have been related to excessive time taken by read
error recovery of "old" data, as indicated by
trade publications.

I haven't seen a press release or page at www.samsung.com that
confirms that the problem is due to read error recovery,
but here is a pointer to a description of the patch:
https://www.samsung.com/global/busin...downloads.html
at: "Samsung SSD 840 EVO Performance Restoration Software"


You are talking about waning retentivity exhibited by magnetic storage
media.

No, I am talking about flash memory. The charges leak away over time
and the speed of leaking away from a given cell increases as the
number of writes to the cell increases.

Typical hidden "specs" that manufacturers used circa 2010
was that end of life for a device was when data would be lost
after 1 year of unpowered storage.

My guess is many consumer devices now are designed for less than
one year life.

I also guess that most manufacturers assume that SSDs, as
contrasted with flash memory keys, are always powered on.
Thus the powered down data retention time for SSDs may be less
than for flash memory keys. This may apply to both consumer
and "enterprise" devices.
Flash memory doesn't exhibit that defect.

Yes, the mechanism is different, but the limited storage
time happens for both.

Note that in 1995 15 or even 100 years was considered the
unpowered storage retention time. Also, the number of
write cycles had been gradually increasing from a few 100
to 100000 or even more, even though the memory was getting
denser

By 2010 (or perhaps a few years before) 1 year powered
off retention was considered end of life for the flash
memory key or whatever.

Oxide stress on the
junctions during writes is what shortens their lifespans (and why
reserved space is used to mask the bad spots but that remapping slows
the device, too). When the reserved space gets consumed, the device
catastrophically fails. The device has wear levelling algorithms
(http://en.wikipedia.org/wiki/Solid-s...#Wear_leveling) to
exercise different junctions for writes to reduce oxide stress on any
particular junction (i.e., spread out the stress).

Wear leveling is to spread out the stress of writing, not the
stress of charge storage. Rewriting of data serves
to refresh the data. Many devices periodically scan, looking
for data that needs refreshing.

That's why you don't
defrag an SSD device.

I don't think it is true that you should never defrag and SSD (or
a flash memory key for that matter, but let us just talk about SSDs.)

Why would you defrag: to increase read speed by reducing seek
delays and the number of I/O operations for large transfers.

I think if you look at the performance of most consumer SSDs
you will find that they in fact do act as if they have
seek times. (It is possible that the "seek" times are due to
extra operations within the device because the data is not
contiguous even though it appears contiguous from the user
point of view. There may be other factors affecting the
speed of the device, but they do in fact look like "seek"
times")

If the user view is fragmented there will be more overhead
in the operating system and more I/O operations to the device.

Current defrag programs get rid of the extra I/O operations,
but not the pseudo-"seek" times from device. I have suggested
that someone work with manufacturers to make a defrag program
that defrags the data as actually stored in SSDs.

Well, you say, spinning disks have 10 milliseconds access times,
consumer SSDs are have 0.1 milliseconds access time, so why bother?
The answer is that many consumer SSDs will jump to 0.2 or even
0.3 milliseconds access times as things get defragmented.
This is likely to reduce performance to 1/2 or even 1/3 of
the ideal performance.

I feel that for consumer products the possible gain by defraging,
even using a program that just defrags the user view, is
worth it if I have seen a performance drop. Things probably
won't get any more fragmented on the device and might get someone
defragmented.

My preferred technique is to copy to a new device
for data disks since it lets me put aside the old device
as a backup, defragments things, and decreases the size of the
NTFS Master File Table. Most SSDs also do a pretty good
job of keeping the fragmentation on the device low also.
(For system disks, might do a clone, defrag the clone,
then clone to a third device to use as my new system disk.
The extra clone operation is so that if the defrag messes up
I still have the original.)

My disks (spinning and system) have lots of free
space so:
1. they don't get very fragmented, so typically I
defragment less often than 2 times per year.
2. defragmenting programs don't have to write
the data more than a time or two. (I don't
know what the Write Amplification Factor is.)

(2 times/year* WAF 3 * new disks every 3 years)
= about 18 writes

Worse case life that I have heard of for consumer
stuff is about 700, so I don't expect 18 extra,
of even 108 extra (for 1/month) to be a big deal.



So far, I only use SSDs for my system disks and testing.


I use SpinRite on my backup drives every 6 months,
but I am getting concerned that 6 months may be too long
a time for consumer SSDs in a powered off state.

I use about 20 SSDs for backups and about 200 spinning drives
for backups.

I don't know the performance numbers for "enterprise" SSDs.
Will, for example, fragmentation reduce the number of I/Os
from 200000 per second to 100000 or even 75000 per second,
or will the number remain above 150000 per second?



You are also talking about MLC NAND Flash memory. MLC (multi-layer cell)
used to up the density results in less reliable reading because of the
less distinct change in states. SLC (single-layer cell) is most reliable
but most costly. MLC gives more bits per package (i.e., you get more
bytes for your buck) at the cost of performance and reliability.

The cost to chip manufacturers per bit MLC
is about 1/2 that of SLC.

The cost to chip manufacturers per bit TLC
is about 1/3 that of SLC.

However, typically price to users for SLC is about 10
times the price to users of TLC.

"enterprise" stuff costs users more per bit, but I haven't
looked at the ratios between SLC, MLC, and TLC for users.

At
Newegg.com, for example, you can find over 900 MLC products but only 1
SLC, and a 32GB MLC costs $60 versus a 32GB SLC at $550.

http://en.wikipedia.org/wiki/Multi-level_cell

That software you mentioned is to apply a firmware update. Then it
realigns the data per whatever change in algorithm the firmware changed.

Yes.

Using your scheme, the testing would be unreliable. There would lots of
reads that succeed (with or without the correction) and then a failure.
But the failure isn't permanent and subsquent reads would succeed. MLC
means less reliable reading. That's the nature of the beast. That's
what correction algorithms are especially needed for MLC to test for
read failures. Testing one spot for rate of read failures does not
indicate what some other weaker or stronger junction may exhibit.

I indicated that there might be nothing to be gained by monitoring
each chip, each array, or even at finer granularity. I just thought
that the manufacturers should consider how local the monitoring
has to be.

My main point, however, was that looking at the (correctable)
error rate that is being seen is not good enough: I feel
that a better device lifetime estimate can be made by seeing
how retention time varies with the number of writes a given
location on the actual device, not just other devices
with the same technology or same batch.

With rewrite lifetimes of 100000 and increasing with
new technology and retention time staying about constant
at more than the expected time the that device needed
to live in hardware, batch or process parameters were
fine.

Which 700 cycles and decreasing, with expected
power off retention time 1 year or less and decreasing,
closer monitoring is needed.

Using 1% of the device for test cells to find problems
early, even though it would reduce spares from 10% or 20%
to 9% or 19%. %1 of 1TB is 10GB, so it sounds like
a lot, but it isn't really.
 




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Battery Life question MZB Dell Computers 36 December 17th 06 03:18 AM
Good morning or good evening depending upon your location. I want to ask you the most important question of your life. Your joy or sorrow for all eternity depends upon your answer. The question is: Are you saved? It is not a question of how good Remnant Banana General 5 April 23rd 05 02:40 AM
Good morning or good evening depending upon your location. I want to ask you the most important question of your life. Your joy or sorrow for all eternity depends upon your answer. The question is: Are you saved? It is not a question of how good RFM Printers 0 April 23rd 05 01:23 AM
Good morning or good evening depending upon your location. I want to ask you the most important question of your life. Your joy or sorrow for all eternity depends upon your answer. The question is: Are you saved? It is not a question of how good Yddap Dell Computers 0 April 22nd 05 01:00 AM
Half-Life 2 Question Jeff McNulty Ati Videocards 7 December 24th 04 07:41 AM


All times are GMT +1. The time now is 02:52 PM.


Powered by vBulletin® Version 3.6.4
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright ©2004-2024 HardwareBanter.
The comments are property of their posters.