Out-of-order writing by disk drives

**Bill Todd** · #1 April 8th 09, 04:19 PM posted to comp.arch.storage

That disks write data out of order when write-back caching is enabled
does not seem at all surprising, since that's one of the main potential
benefits of having write-back caching enabled. I'd be far more
concerned if you had found that disks ever wrote data out of order with
write-back caching disabled (and indeed I've heard anecdotes that some
did - perhaps because they just never disabled write-back caching
regardless of what they were told to do to obtain better performance
numbers or simply due to incompetent firmware).

The only other explanation I can readily come up with for why 3064
sectors might be written out of order would involve the heuristics
employed in the write-back caching algorithm (e.g., that's the maximum
amount of cache space it will allow dirty data to occupy before
destaging it to disk).

- bill

**Anton Ertl** · #2 April 8th 09, 04:30 PM posted to comp.arch.storage

Bill Todd writes:
That disks write data out of order when write-back caching is enabled
does not seem at all surprising, since that's one of the main potential
benefits of having write-back caching enabled.

Yes. But some people seem to imagine that this is a very small effect
that can be ignored without ill effects on the consistency of the
on-disk data of a file system; this attitude is exemplified by having
barrier=1 disabled by default in the ext3 file system in Linux.

The test demonstrates that the reordering can happen over several
seconds.

The only other explanation I can readily come up with for why 3064
sectors might be written out of order would involve the heuristics
employed in the write-back caching algorithm (e.g., that's the maximum
amount of cache space it will allow dirty data to occupy before
destaging it to disk).

That's a very good explanation. Given that the program ran
significantly slower (about 6MB/s transfer rate) than what the drive
is capable of (70MB/s), it's not surprising that most tests resulted
in turning off the drive between such batches, with only one happening
during a batch.

Hmm, I could test my track-size theory by working on another area of
the drive (but I am probably too lazy to do that; your theory sounds
better anyway:-); if it's really a multiple of the track size, the
number should change, because the track size varies.

BTW, I used blocks with 1KB, so it's 6128 sectors.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

**Bill Todd** · #3 April 8th 09, 05:37 PM posted to comp.arch.storage

Anton Ertl wrote:
Bill Todd writes:
That disks write data out of order when write-back caching is enabled
does not seem at all surprising, since that's one of the main potential
benefits of having write-back caching enabled.

Yes. But some people seem to imagine that this is a very small effect
that can be ignored without ill effects on the consistency of the
on-disk data of a file system; this attitude is exemplified by having
barrier=1 disabled by default in the ext3 file system in Linux.

The test demonstrates that the reordering can happen over several
seconds.

That indeed seems to be quite a long time - but then it wasn't so long
ago that Unix systems would by default allow writes to languish for as
much as 30 seconds (with no particular guarantees about ordering when
they actually got destaged) so I can't really fault the disk vendors for
this: as has always been the case, if you want ordering guarantees, you
need to take explicit steps to ensure them.

....

the program ran
significantly slower (about 6MB/s transfer rate) than what the drive
is capable of (70MB/s)

Interesting: it implies that the disk was destaging a few blocks every
rev rather than waiting for a track to fill up (what are track sizes on
those 400 GB disks - 0.5 MB or so?) but was still very reluctant to move
the head to give those block 0 writes a reasonable chance. That doesn't
strike me as a very good approach (achieving neither decent throughput
nor reasonable fairness) assuming that you did present the non-block-0
writes in strictly ascending order.

Have you tested disks to see whether they indeed destaged single large
transfers out of order (as many claim to when the write is at least a
large percentage of a track in size)?

- bill

**Anton Ertl** · #4 April 8th 09, 08:09 PM posted to comp.arch.storage

Bill Todd writes:
Anton Ertl wrote:
Bill Todd writes:
That disks write data out of order when write-back caching is enabled
does not seem at all surprising, since that's one of the main potential
benefits of having write-back caching enabled.

Yes. But some people seem to imagine that this is a very small effect
that can be ignored without ill effects on the consistency of the
on-disk data of a file system; this attitude is exemplified by having
barrier=1 disabled by default in the ext3 file system in Linux.

The test demonstrates that the reordering can happen over several
seconds.

That indeed seems to be quite a long time - but then it wasn't so long
ago that Unix systems would by default allow writes to languish for as
much as 30 seconds (with no particular guarantees about ordering when
they actually got destaged) so I can't really fault the disk vendors for
this: as has always been the case, if you want ordering guarantees, you
need to take explicit steps to ensure them.

Yes, nowadays you can have them without turning off write caching
completely, so it's entirely reasonable.

There are file systems like ext3 with data=ordered or data=journal or
BSD FFS with soft updates that do give guarantees about ordering. But
in order to implement these guarantees they must take the explicit
steps, and ext3 does not do that by default.

the program ran
significantly slower (about 6MB/s transfer rate) than what the drive
is capable of (70MB/s)

Interesting: it implies that the disk was destaging a few blocks every
rev rather than waiting for a track to fill up (what are track sizes on
those 400 GB disks - 0.5 MB or so?)

At 70MB/s and 7200rpm=120/s the track size is at least
70(MB/s)/120(/s)=0.583MB. Probably a little larger because aligning
the head for the next platter or moving it to the next cylinder also
costs a little time on each revolution.

My guess (inspired by you) is that it destaged 3064KB at a time. The
slow transfer rate is probably a result of doing synchronous writes to
the disk buffers; the write would only report completion when the data
has arrived in the disk's buffers, and only then the next write would
start and weave its way through the various subsystems.

but was still very reluctant to move
the head to give those block 0 writes a reasonable chance. That doesn't
strike me as a very good approach (achieving neither decent throughput
nor reasonable fairness) assuming that you did present the non-block-0
writes in strictly ascending order.

Yes, if it waited about a second for the 3064KB to accumulate (the
other 3MB/s are spent writing block 0 repeatedly) and then needed 40ms
to write them to the platters, there would have been ample time to
write the block 0 between each batch. My guess is that it tries to
write the blocks roughly in the order of age, and block 0 is rarely
the oldest one it sees because it gets overwritten by younger
instances all the time.

Have you tested disks to see whether they indeed destaged single large
transfers out of order (as many claim to when the write is at least a
large percentage of a track in size)?

No. How would you test that? But given the results of this test, it
seems most plausible to me that the ST340062 destages 3064KB at a time
if it gets that much sequential data.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

**Bill Todd** · #5 April 8th 09, 09:18 PM posted to comp.arch.storage

Anton Ertl wrote:
Bill Todd writes:
Anton Ertl wrote:
Bill Todd writes:
That disks write data out of order when write-back caching is enabled
does not seem at all surprising, since that's one of the main potential
benefits of having write-back caching enabled.
Yes. But some people seem to imagine that this is a very small effect
that can be ignored without ill effects on the consistency of the
on-disk data of a file system; this attitude is exemplified by having
barrier=1 disabled by default in the ext3 file system in Linux.

The test demonstrates that the reordering can happen over several
seconds.
That indeed seems to be quite a long time - but then it wasn't so long
ago that Unix systems would by default allow writes to languish for as
much as 30 seconds (with no particular guarantees about ordering when
they actually got destaged) so I can't really fault the disk vendors for
this: as has always been the case, if you want ordering guarantees, you
need to take explicit steps to ensure them.

Yes, nowadays you can have them without turning off write caching
completely, so it's entirely reasonable.

There are file systems like ext3 with data=ordered or data=journal or
BSD FFS with soft updates that do give guarantees about ordering. But
in order to implement these guarantees they must take the explicit
steps, and ext3 does not do that by default.

I may have been too quick to ignore soft updates (AFAIK unique to BSD
and thus not typical of Unix capabilities in general) and the optional
behavior of ext3 (again, not generally available in most Unixes AFAIK) -
my point was that I didn't think that write-back delays (with resulting
out-of-order writes) of even a few seconds constituted irresponsible
behavior on the part of disk vendors given the typical lack of ordering
guarantees in the systems their disks ran in (actually, the main use of
ATA and SATA disks may be in Windows boxes, so perhaps that should have
been the focus of my comment: NTFS does attempt to control ordering, at
least for critical metadata updates, even with write-back caching
enabled, but I think only on drives that support the force unit access
flag, which at least until somewhat recently many ATA and perhaps even
SATA drives did not).

the program ran
significantly slower (about 6MB/s transfer rate) than what the drive
is capable of (70MB/s)
Interesting: it implies that the disk was destaging a few blocks every
rev rather than waiting for a track to fill up (what are track sizes on
those 400 GB disks - 0.5 MB or so?)

At 70MB/s and 7200rpm=120/s the track size is at least
70(MB/s)/120(/s)=0.583MB.

Duh - on my better days I would have thought of that rather than just
being too lazy to look up the specs at seagate.com.

Probably a little larger because aligning
the head for the next platter or moving it to the next cylinder also
costs a little time on each revolution.

1 ms or less these days IIRC - around 10% +/- of a rev at 7200 rpm.

My guess (inspired by you) is that it destaged 3064KB at a time. The
slow transfer rate is probably a result of doing synchronous writes to
the disk buffers; the write would only report completion when the data
has arrived in the disk's buffers, and only then the next write would
start and weave its way through the various subsystems.

Even so that should result in something close to half the max transfer
rate (sounds as if all your writes were near the outer edge of the disk,
so we don't have to worry about varying track sizes). Or, if after
every 3+ MB written it seeked (sought? never thought about that...) to
track 0 to update block 0 (perhaps the seek back could hide behind the
next 3 MB transfer over the bus) that would still add only around 10 ms.
(short seek plus 1/2 rotation on average) to the roughly 55 ms. write
time, decreasing throughput only to a little under 30 MB/sec rather than
to the 6 MB/sec that you saw.

That's why I suspected that it was destaging dirty data in much smaller
chunks. For example, if it destaged 64 KB each time and then missed a
full rev before continuing (but stayed on-track rather than went to
track 0) the transfer rate would be under 7 MB/sec. But that would be a
somewhat brain-damaged way to go about things given today's on-disk
cache sizes and controller intelligence, since it could just use
multi-buffering plus a smidge more cache space to accept new data
continually while it destaged dirty data continually.

....

My guess is that it tries to
write the blocks roughly in the order of age, and block 0 is rarely
the oldest one it sees because it gets overwritten by younger
instances all the time.

That explains why block 0 gets updated so infrequently but not the
abysmal transfer rate for the rest of the blocks. (And since block 0 is
getting updated on the platters only rarely that activity would seem to
consume only a small percentage of the disk bandwidth, whatever it is).

Have you tested disks to see whether they indeed destaged single large
transfers out of order (as many claim to when the write is at least a
large percentage of a track in size)?

No. How would you test that?

By issuing continual near-full-track writes to random locations on a
zero-filled disk and then pulling the plug a few times to see whether
any of them wound up with a partial write that did not start at the
beginning of the request.

- bill

**Maxim S. Shatskih[_2_]** · #6 April 9th 09, 07:05 PM posted to comp.arch.storage

been the focus of my comment: NTFS does attempt to control ordering, at
least for critical metadata updates, even with write-back caching
enabled, but I think only on drives that support the force unit access
flag, which at least until somewhat recently many ATA and perhaps even
SATA drives did not).

Yes, NTFS is careless about ordering on anything but the logfile, and logfile updates are done using the FUA bit.

I don't know off-head how FUA bit is interpreted by the Windows (S)ATA stack, but, from what I remember, ATA spec had no analogs at all until rather recently.

Probably Windows (S)ATA stack can flush the whole in-drive cache before completing the FUA request, but this is the guess in the wild and can be wrong.

--
Maxim S. Shatskih
Windows DDK MVP

http://www.storagecraft.com

**Anton Ertl** · #7 April 9th 09, 09:07 PM posted to comp.arch.storage

Bill Todd writes:
Anton Ertl wrote:
Bill Todd writes:
Anton Ertl wrote:
The test demonstrates that the reordering can happen over several
seconds.
That indeed seems to be quite a long time - but then it wasn't so long
ago that Unix systems would by default allow writes to languish for as
much as 30 seconds (with no particular guarantees about ordering when
they actually got destaged) so I can't really fault the disk vendors for
this: as has always been the case, if you want ordering guarantees, you
need to take explicit steps to ensure them.

Yes, nowadays you can have them without turning off write caching
completely, so it's entirely reasonable.

There are file systems like ext3 with data=ordered or data=journal or
BSD FFS with soft updates that do give guarantees about ordering. But
in order to implement these guarantees they must take the explicit
steps, and ext3 does not do that by default.

I may have been too quick to ignore soft updates (AFAIK unique to BSD
and thus not typical of Unix capabilities in general) and the optional
behavior of ext3 (again, not generally available in most Unixes AFAIK) -

That "option" is the default for ext3. Concerning other Unix file
systems, they at least try to preserve metadata consistency on
crashes, and to do that, they need guarantees about the order of
writes. Journaling file systems need guarantees about the order of
journal writes as well as journal writes relative to the writes the
journal entries describe.

And even the bad old BSD FFS tried to perform sychnronous metadata
writes in order to preserve metadata consistency, and required these
writes to happen in-order in order for the fsck to be able to recover
the metadata; if the writes occured in order, only one block could be
wrong, and the fsck relied on that.

my point was that I didn't think that write-back delays (with resulting
out-of-order writes)

Write-back delays are one thing, out-of-order writes are a very
different thing. Delaying the writes means that one loses a few
seconds worth of changes on a crash (that may or may not be
acceptable); out-of-order writes can destroy the consistency of the
file system.

of even a few seconds constituted irresponsible
behavior on the part of disk vendors given the typical lack of ordering
guarantees in the systems their disks ran in

IMO running any file system that contains data that's worth preserving
on a drive in a mode that allows reordering to happen beyond the
control of the file system is irresponsible on part of the sysadmin;
and making such a behaviour the default is irresponsible on part of
the file system developer. I.e., if the drive offers support for
queuing or tagged commands, the file system should use them by
default, and if it doesn't, the file system should turn off write
caching on the drive by default.

My guess (inspired by you) is that it destaged 3064KB at a time. The
slow transfer rate is probably a result of doing synchronous writes to
the disk buffers; the write would only report completion when the data
has arrived in the disk's buffers, and only then the next write would
start and weave its way through the various subsystems.

Even so that should result in something close to half the max transfer
rate

IMO the delays come from latencies in the communication between user
process, various kernel components, the host adapter, and the disk
controller, because all of that's going on synchronously (the only
asynchronous part there is the writing of the data to the platters);
therefore I don't expect the maximum disk write rate to have little
influence. I expect that if I double the block size, the transfer
rate doubles until I approximate the maximum transfer rate.

(sounds as if all your writes were near the outer edge of the disk,
so we don't have to worry about varying track sizes).

The non-block-0 writes start in the middle of the device.

Have you tested disks to see whether they indeed destaged single large
transfers out of order (as many claim to when the write is at least a
large percentage of a track in size)?

No. How would you test that?

By issuing continual near-full-track writes to random locations on a
zero-filled disk and then pulling the plug a few times to see whether
any of them wound up with a partial write that did not start at the
beginning of the request.

That's certainly an interesting test. Maybe next time (another ten
years?).

- anton
--
M. Anton Ertl Some things have to be seen to be believed
Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

**Maxim S. Shatskih[_2_]** · #8 April 9th 09, 10:13 PM posted to comp.arch.storage

writes. Journaling file systems need guarantees about the order of
journal writes as well as journal writes relative to the writes the
journal entries describe.

Usually, the update is first written to the journal (and must reach the hard disk media) and only then is reflected in the actual metadata.

In this case, it is enough to only use FUA (or some similar thing emulated on ATA, for instance, drive's cache flush after each such write) on journal writes.

--
Maxim S. Shatskih
Windows DDK MVP

http://www.storagecraft.com

**Anton Ertl** · #9 April 10th 09, 01:51 PM posted to comp.arch.storage

"Maxim S. Shatskih" writes:
writes. Journaling file systems need guarantees about the order of
journal writes as well as journal writes relative to the writes the
journal entries describe.

Usually, the update is first written to the journal (and must reach the =
hard disk media) and only then is reflected in the actual metadata.

In this case, it is enough to only use FUA (or some similar thing =
emulated on ATA, for instance, drive's cache flush after each such =
write) on journal writes.

Yes, any feature that ensures partial ordering is sufficient. But
using write caching without any such features is not.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Thread Tools
Show Printable Version Email this Page
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Out-of-order writing by disk drives	Anton Ertl	Storage (alternative)	0	April 7th 09 09:21 PM
DMA issue while writing data to hard disk	maverick	Storage & Hardrives	3	May 28th 08 01:00 AM
Worth writing zeros to my used hard drives?	Jax	Storage (alternative)	18	July 2nd 07 04:29 AM
External USB disk reading/writing when not connected	Gustaf	Storage (alternative)	1	July 29th 06 06:16 PM
external hard drives & cd-writing	travis	Storage (alternative)	0	February 24th 05 01:58 PM