View Single Post
  #3  
Old June 22nd 09, 09:26 PM posted to comp.arch.storage,comp.arch
Stephen Fuld[_3_]
external usenet poster
 
Posts: 3
Default disk error bursts

Bill Todd wrote:

There hasn't been much discussion on this, so, with all the same caveats
I mentioned on my previous posts on disk technology, I'll give some
information.

The frequency with which disk sectors become unreadable seems to have
remained relatively stable over the years,


Yes.

though one might hope that it
will decrease significantly with the expected move to 4 KB sectors


Probably not - see below.

(at
least if correction bits per sector increase somewhere nearly
commensurately).


Not exactly.

First of all, is the move to 4K sectors going to happen? One of the
last things I did before I retired 10 years ago was to attend a meeting,
hosted by IBM Research in San Jose, attended by all the major disk
manufacturers, to discuss this issue. I was the "systems guy", as I was
the only one who wasn't a disk device engineer type. I talked about the
issues relating to the interface (i.e. ATA had no capability to specify
a sector length so I talked about alternatives for backwards
compatibility) and the people who, for various reasons, didn't use
exactly 512 bytes per sector and what would be needed to satisfy them
with larger sector sizes. Anyway, since the transition to 4K sectors
hasn't happened in ten years, I sort of assumed that the issue was long
dead and wasn't going to happen at all, but I have been out of it, so I
may very well be wrong.

Now, getting back to newly bad sectors. What was causing most of them
was a phenomenon called thermal asperities (TA). These are small (a
single digit number of bits long) "bumps" in the surface of the media
caused by additional particles sputtered on the media. They were not
high enough to touch the heads when the heads were flying at normal
height. But if the fly height were at its lower margin, the heads
could "graze" the top of the asperity. This didn't damage the heads,
but the friction could generate enough heat to cause a local erasure of
the data. If the data was rewritten, things would be fine again. The
asperities were randomly distributed over the disk.

To counteract these TAs, the vendors used variants of Reed Solomon ECC
codes that could correct up to two bursts of say 10-12 bits per sector
and detect up to three such bursts. This allowed correction of up to
two TAs per sector.

Longer defects were generally caused by a 'scratch" where the head
contacted the media for a longer period of time, and these were of
arbitrary length. If they were short enough, they could be corrected by
the ECC. Of course, if they were too long, it represented a head crash
and it really didn't matter what ECC you used. :-(

You would expect the size in bits of the TAs to get larger with
increasing BPI, but process improvements seemed to just about keep up,
so there wasn't much growth in bit length correction capability.

So, getting back to longer sectors, I would expect vendors to use ECCs
that have the capability to detect and correct more TAs per sector, not
longer bursts per say. Of course, if the errors were consecutive,
adding more bursts to the span does increase the length of a single
error you can correct, but that is secondary.

But it's difficult to find information about how frequently error bursts
that result in unreadability span a sector boundary.


And, even if you did, probably not useful. Remember there is a lot of
stuff between sectors on a disk. That is, they are not consecutive bits
around the track. Besides the ECC, there are sync fields, gaps, and
occasional servo bursts (headers are long gone). So depending on the
details, you can't correlate consecutive sector losses with what would
happen with longer sectors.


In particular, I'm
interested in whether two successive sectors ever (in a practical and
quantifiable sense) become unreadable due to a single error burst, and
if so whether such a burst ever affects more than two successive sectors
(it would also be interesting to know how frequently bursts span tracks
such that they affect sectors on adjacent tracks, though that is of less
immediate significance to me).


Just as a matter of interest, disk vendors have scratch detection
algorithms for scratches that run radially. Thus, if during formatting,
they detect an error on the same sector (modulo effects like record
number skew and zoned bit recording) in say tracks 30,31,33 and 34, they
will mark the corresponding sector in track 32 as bad in order to
prevent a possible error later.


But in direct answer to your question, I don't know of anyone who has
the statistics that you want.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)