HardwareBanter - View Single Post

#26 July 16th 04, 06:19 PM

In article ,
Ron Reaugh wrote:
....
A first stab at that process is called nightly backup and the second stab is
scheduled defrags. "silent sector deterioration" can happen but is usually
an isolated sector here or there and is quite uncommon.

Yes, good arrays all have scrubbing capabilities (or should have
them). But life isn't quite so easy. Many disk workloads show very
high locality: For long stretches, the actuator stays at or near the
same position. If you start scrubbing carelessly while a
low-intensity foreground workload is running, the response time for
real IOs can increase quite precipitously. So the trick with
implementing scrubbing is to forecast when the foreground workload
will be idle. Like all forecasting of the future, this is quite
difficult (if I knew how to do it, I would play the stock market, and
get out of the storage business).

Note that good scrubbing has to be done internally to the array,
because external scrubbing (for example a full backup, or just reading
the block device end to end) will not touch all sectors on all disks.
And depending on how the array is implemented, it may never touch some
sectors (for example, as long as no disk has failed, most arrays will
never read the parity block on a RAID-5 group). So this isn't
something the user of a disk array can take care of himself.

Good RAID 1 will
fill the new/replacement drive inspite of such a sector read error and then
one is left with an operable system with an isolated read error that may be
dealt with. Depending on the definition of "data loss" this issue may not
count and is relatively obscure. Modern HDs are quite good at being able to
read/recover their data.

Well, the promise of RAIDed disks is that there is NO data loss. I
personally think that as soon as I lose a sector, I have violated my
contract with the end user. Clearly losing one sector is better than
losing a whole LUN or a whole array. But if that sector is in an
allocated area (of the file system or the database that sits above),
the array has corrupted or invalidated data. That's why to many
customers the first bit error invalidates the whole LUN - as soon as
you lose a single sector, you'll have some explaining to do (often
takes the form that a C-level executive has to call the customer and
apologize, followed by massive price cuts or rebates.

If you look at the introduction history and market penetration of the
big disk arrays (EMC Symmetrix, Hitachi Lightning, IBM Shark and so
on), you'll see that the "public perception" of data reliability has
been a big factor in selling and pricing; I don't want to go into
details, as they are sure to step on someones foot. Whether the
"public perception" of data reliability is actually correlated with
the real incidence of data loss is an interesting study in mass
psychology and the power of marketing over engineering. But what is
clear is that there are many customer who are perfectly willing to pay
a lot of extra money (a factor of 2, 3 or 10 more than the lowest
bidder) and select a vendor that gives them a warm and fuzzy feeling
(and maybe also real technical advantages, or even contractual
guarantees) about the quality and reliability of the disk array.

--
The address in the header is invalid for obvious reasons. Please
reconstruct the address from the information below (look for _).
Ralph Becker-Szendy _firstname_@lr _dot_ los-gatos _dot_ ca.us