If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. |
|
|
Thread Tools | Display Modes |
#1
|
|||
|
|||
Out-of-order writing by disk drives
That disks write data out of order when write-back caching is enabled
does not seem at all surprising, since that's one of the main potential benefits of having write-back caching enabled. I'd be far more concerned if you had found that disks ever wrote data out of order with write-back caching disabled (and indeed I've heard anecdotes that some did - perhaps because they just never disabled write-back caching regardless of what they were told to do to obtain better performance numbers or simply due to incompetent firmware). The only other explanation I can readily come up with for why 3064 sectors might be written out of order would involve the heuristics employed in the write-back caching algorithm (e.g., that's the maximum amount of cache space it will allow dirty data to occupy before destaging it to disk). - bill |
#2
|
|||
|
|||
Out-of-order writing by disk drives
Bill Todd writes:
That disks write data out of order when write-back caching is enabled does not seem at all surprising, since that's one of the main potential benefits of having write-back caching enabled. Yes. But some people seem to imagine that this is a very small effect that can be ignored without ill effects on the consistency of the on-disk data of a file system; this attitude is exemplified by having barrier=1 disabled by default in the ext3 file system in Linux. The test demonstrates that the reordering can happen over several seconds. The only other explanation I can readily come up with for why 3064 sectors might be written out of order would involve the heuristics employed in the write-back caching algorithm (e.g., that's the maximum amount of cache space it will allow dirty data to occupy before destaging it to disk). That's a very good explanation. Given that the program ran significantly slower (about 6MB/s transfer rate) than what the drive is capable of (70MB/s), it's not surprising that most tests resulted in turning off the drive between such batches, with only one happening during a batch. Hmm, I could test my track-size theory by working on another area of the drive (but I am probably too lazy to do that; your theory sounds better anyway:-); if it's really a multiple of the track size, the number should change, because the track size varies. BTW, I used blocks with 1KB, so it's 6128 sectors. - anton -- M. Anton Ertl Some things have to be seen to be believed Most things have to be believed to be seen http://www.complang.tuwien.ac.at/anton/home.html |
#3
|
|||
|
|||
Out-of-order writing by disk drives
Anton Ertl wrote:
Bill Todd writes: That disks write data out of order when write-back caching is enabled does not seem at all surprising, since that's one of the main potential benefits of having write-back caching enabled. Yes. But some people seem to imagine that this is a very small effect that can be ignored without ill effects on the consistency of the on-disk data of a file system; this attitude is exemplified by having barrier=1 disabled by default in the ext3 file system in Linux. The test demonstrates that the reordering can happen over several seconds. That indeed seems to be quite a long time - but then it wasn't so long ago that Unix systems would by default allow writes to languish for as much as 30 seconds (with no particular guarantees about ordering when they actually got destaged) so I can't really fault the disk vendors for this: as has always been the case, if you want ordering guarantees, you need to take explicit steps to ensure them. .... the program ran significantly slower (about 6MB/s transfer rate) than what the drive is capable of (70MB/s) Interesting: it implies that the disk was destaging a few blocks every rev rather than waiting for a track to fill up (what are track sizes on those 400 GB disks - 0.5 MB or so?) but was still very reluctant to move the head to give those block 0 writes a reasonable chance. That doesn't strike me as a very good approach (achieving neither decent throughput nor reasonable fairness) assuming that you did present the non-block-0 writes in strictly ascending order. Have you tested disks to see whether they indeed destaged single large transfers out of order (as many claim to when the write is at least a large percentage of a track in size)? - bill |
#4
|
|||
|
|||
Out-of-order writing by disk drives
Bill Todd writes:
Anton Ertl wrote: Bill Todd writes: That disks write data out of order when write-back caching is enabled does not seem at all surprising, since that's one of the main potential benefits of having write-back caching enabled. Yes. But some people seem to imagine that this is a very small effect that can be ignored without ill effects on the consistency of the on-disk data of a file system; this attitude is exemplified by having barrier=1 disabled by default in the ext3 file system in Linux. The test demonstrates that the reordering can happen over several seconds. That indeed seems to be quite a long time - but then it wasn't so long ago that Unix systems would by default allow writes to languish for as much as 30 seconds (with no particular guarantees about ordering when they actually got destaged) so I can't really fault the disk vendors for this: as has always been the case, if you want ordering guarantees, you need to take explicit steps to ensure them. Yes, nowadays you can have them without turning off write caching completely, so it's entirely reasonable. There are file systems like ext3 with data=ordered or data=journal or BSD FFS with soft updates that do give guarantees about ordering. But in order to implement these guarantees they must take the explicit steps, and ext3 does not do that by default. the program ran significantly slower (about 6MB/s transfer rate) than what the drive is capable of (70MB/s) Interesting: it implies that the disk was destaging a few blocks every rev rather than waiting for a track to fill up (what are track sizes on those 400 GB disks - 0.5 MB or so?) At 70MB/s and 7200rpm=120/s the track size is at least 70(MB/s)/120(/s)=0.583MB. Probably a little larger because aligning the head for the next platter or moving it to the next cylinder also costs a little time on each revolution. My guess (inspired by you) is that it destaged 3064KB at a time. The slow transfer rate is probably a result of doing synchronous writes to the disk buffers; the write would only report completion when the data has arrived in the disk's buffers, and only then the next write would start and weave its way through the various subsystems. but was still very reluctant to move the head to give those block 0 writes a reasonable chance. That doesn't strike me as a very good approach (achieving neither decent throughput nor reasonable fairness) assuming that you did present the non-block-0 writes in strictly ascending order. Yes, if it waited about a second for the 3064KB to accumulate (the other 3MB/s are spent writing block 0 repeatedly) and then needed 40ms to write them to the platters, there would have been ample time to write the block 0 between each batch. My guess is that it tries to write the blocks roughly in the order of age, and block 0 is rarely the oldest one it sees because it gets overwritten by younger instances all the time. Have you tested disks to see whether they indeed destaged single large transfers out of order (as many claim to when the write is at least a large percentage of a track in size)? No. How would you test that? But given the results of this test, it seems most plausible to me that the ST340062 destages 3064KB at a time if it gets that much sequential data. - anton -- M. Anton Ertl Some things have to be seen to be believed Most things have to be believed to be seen http://www.complang.tuwien.ac.at/anton/home.html |
#5
|
|||
|
|||
Out-of-order writing by disk drives
Anton Ertl wrote:
Bill Todd writes: Anton Ertl wrote: Bill Todd writes: That disks write data out of order when write-back caching is enabled does not seem at all surprising, since that's one of the main potential benefits of having write-back caching enabled. Yes. But some people seem to imagine that this is a very small effect that can be ignored without ill effects on the consistency of the on-disk data of a file system; this attitude is exemplified by having barrier=1 disabled by default in the ext3 file system in Linux. The test demonstrates that the reordering can happen over several seconds. That indeed seems to be quite a long time - but then it wasn't so long ago that Unix systems would by default allow writes to languish for as much as 30 seconds (with no particular guarantees about ordering when they actually got destaged) so I can't really fault the disk vendors for this: as has always been the case, if you want ordering guarantees, you need to take explicit steps to ensure them. Yes, nowadays you can have them without turning off write caching completely, so it's entirely reasonable. There are file systems like ext3 with data=ordered or data=journal or BSD FFS with soft updates that do give guarantees about ordering. But in order to implement these guarantees they must take the explicit steps, and ext3 does not do that by default. I may have been too quick to ignore soft updates (AFAIK unique to BSD and thus not typical of Unix capabilities in general) and the optional behavior of ext3 (again, not generally available in most Unixes AFAIK) - my point was that I didn't think that write-back delays (with resulting out-of-order writes) of even a few seconds constituted irresponsible behavior on the part of disk vendors given the typical lack of ordering guarantees in the systems their disks ran in (actually, the main use of ATA and SATA disks may be in Windows boxes, so perhaps that should have been the focus of my comment: NTFS does attempt to control ordering, at least for critical metadata updates, even with write-back caching enabled, but I think only on drives that support the force unit access flag, which at least until somewhat recently many ATA and perhaps even SATA drives did not). the program ran significantly slower (about 6MB/s transfer rate) than what the drive is capable of (70MB/s) Interesting: it implies that the disk was destaging a few blocks every rev rather than waiting for a track to fill up (what are track sizes on those 400 GB disks - 0.5 MB or so?) At 70MB/s and 7200rpm=120/s the track size is at least 70(MB/s)/120(/s)=0.583MB. Duh - on my better days I would have thought of that rather than just being too lazy to look up the specs at seagate.com. Probably a little larger because aligning the head for the next platter or moving it to the next cylinder also costs a little time on each revolution. 1 ms or less these days IIRC - around 10% +/- of a rev at 7200 rpm. My guess (inspired by you) is that it destaged 3064KB at a time. The slow transfer rate is probably a result of doing synchronous writes to the disk buffers; the write would only report completion when the data has arrived in the disk's buffers, and only then the next write would start and weave its way through the various subsystems. Even so that should result in something close to half the max transfer rate (sounds as if all your writes were near the outer edge of the disk, so we don't have to worry about varying track sizes). Or, if after every 3+ MB written it seeked (sought? never thought about that...) to track 0 to update block 0 (perhaps the seek back could hide behind the next 3 MB transfer over the bus) that would still add only around 10 ms. (short seek plus 1/2 rotation on average) to the roughly 55 ms. write time, decreasing throughput only to a little under 30 MB/sec rather than to the 6 MB/sec that you saw. That's why I suspected that it was destaging dirty data in much smaller chunks. For example, if it destaged 64 KB each time and then missed a full rev before continuing (but stayed on-track rather than went to track 0) the transfer rate would be under 7 MB/sec. But that would be a somewhat brain-damaged way to go about things given today's on-disk cache sizes and controller intelligence, since it could just use multi-buffering plus a smidge more cache space to accept new data continually while it destaged dirty data continually. .... My guess is that it tries to write the blocks roughly in the order of age, and block 0 is rarely the oldest one it sees because it gets overwritten by younger instances all the time. That explains why block 0 gets updated so infrequently but not the abysmal transfer rate for the rest of the blocks. (And since block 0 is getting updated on the platters only rarely that activity would seem to consume only a small percentage of the disk bandwidth, whatever it is). Have you tested disks to see whether they indeed destaged single large transfers out of order (as many claim to when the write is at least a large percentage of a track in size)? No. How would you test that? By issuing continual near-full-track writes to random locations on a zero-filled disk and then pulling the plug a few times to see whether any of them wound up with a partial write that did not start at the beginning of the request. - bill |
#6
|
|||
|
|||
Out-of-order writing by disk drives
been the focus of my comment: NTFS does attempt to control ordering, at
least for critical metadata updates, even with write-back caching enabled, but I think only on drives that support the force unit access flag, which at least until somewhat recently many ATA and perhaps even SATA drives did not). Yes, NTFS is careless about ordering on anything but the logfile, and logfile updates are done using the FUA bit. I don't know off-head how FUA bit is interpreted by the Windows (S)ATA stack, but, from what I remember, ATA spec had no analogs at all until rather recently. Probably Windows (S)ATA stack can flush the whole in-drive cache before completing the FUA request, but this is the guess in the wild and can be wrong. -- Maxim S. Shatskih Windows DDK MVP http://www.storagecraft.com |
#7
|
|||
|
|||
Out-of-order writing by disk drives
Bill Todd writes:
Anton Ertl wrote: Bill Todd writes: Anton Ertl wrote: The test demonstrates that the reordering can happen over several seconds. That indeed seems to be quite a long time - but then it wasn't so long ago that Unix systems would by default allow writes to languish for as much as 30 seconds (with no particular guarantees about ordering when they actually got destaged) so I can't really fault the disk vendors for this: as has always been the case, if you want ordering guarantees, you need to take explicit steps to ensure them. Yes, nowadays you can have them without turning off write caching completely, so it's entirely reasonable. There are file systems like ext3 with data=ordered or data=journal or BSD FFS with soft updates that do give guarantees about ordering. But in order to implement these guarantees they must take the explicit steps, and ext3 does not do that by default. I may have been too quick to ignore soft updates (AFAIK unique to BSD and thus not typical of Unix capabilities in general) and the optional behavior of ext3 (again, not generally available in most Unixes AFAIK) - That "option" is the default for ext3. Concerning other Unix file systems, they at least try to preserve metadata consistency on crashes, and to do that, they need guarantees about the order of writes. Journaling file systems need guarantees about the order of journal writes as well as journal writes relative to the writes the journal entries describe. And even the bad old BSD FFS tried to perform sychnronous metadata writes in order to preserve metadata consistency, and required these writes to happen in-order in order for the fsck to be able to recover the metadata; if the writes occured in order, only one block could be wrong, and the fsck relied on that. my point was that I didn't think that write-back delays (with resulting out-of-order writes) Write-back delays are one thing, out-of-order writes are a very different thing. Delaying the writes means that one loses a few seconds worth of changes on a crash (that may or may not be acceptable); out-of-order writes can destroy the consistency of the file system. of even a few seconds constituted irresponsible behavior on the part of disk vendors given the typical lack of ordering guarantees in the systems their disks ran in IMO running any file system that contains data that's worth preserving on a drive in a mode that allows reordering to happen beyond the control of the file system is irresponsible on part of the sysadmin; and making such a behaviour the default is irresponsible on part of the file system developer. I.e., if the drive offers support for queuing or tagged commands, the file system should use them by default, and if it doesn't, the file system should turn off write caching on the drive by default. My guess (inspired by you) is that it destaged 3064KB at a time. The slow transfer rate is probably a result of doing synchronous writes to the disk buffers; the write would only report completion when the data has arrived in the disk's buffers, and only then the next write would start and weave its way through the various subsystems. Even so that should result in something close to half the max transfer rate IMO the delays come from latencies in the communication between user process, various kernel components, the host adapter, and the disk controller, because all of that's going on synchronously (the only asynchronous part there is the writing of the data to the platters); therefore I don't expect the maximum disk write rate to have little influence. I expect that if I double the block size, the transfer rate doubles until I approximate the maximum transfer rate. (sounds as if all your writes were near the outer edge of the disk, so we don't have to worry about varying track sizes). The non-block-0 writes start in the middle of the device. Have you tested disks to see whether they indeed destaged single large transfers out of order (as many claim to when the write is at least a large percentage of a track in size)? No. How would you test that? By issuing continual near-full-track writes to random locations on a zero-filled disk and then pulling the plug a few times to see whether any of them wound up with a partial write that did not start at the beginning of the request. That's certainly an interesting test. Maybe next time (another ten years?). - anton -- M. Anton Ertl Some things have to be seen to be believed Most things have to be believed to be seen http://www.complang.tuwien.ac.at/anton/home.html |
#8
|
|||
|
|||
Out-of-order writing by disk drives
writes. Journaling file systems need guarantees about the order of
journal writes as well as journal writes relative to the writes the journal entries describe. Usually, the update is first written to the journal (and must reach the hard disk media) and only then is reflected in the actual metadata. In this case, it is enough to only use FUA (or some similar thing emulated on ATA, for instance, drive's cache flush after each such write) on journal writes. -- Maxim S. Shatskih Windows DDK MVP http://www.storagecraft.com |
#9
|
|||
|
|||
Out-of-order writing by disk drives
"Maxim S. Shatskih" writes:
writes. Journaling file systems need guarantees about the order of journal writes as well as journal writes relative to the writes the journal entries describe. Usually, the update is first written to the journal (and must reach the = hard disk media) and only then is reflected in the actual metadata. In this case, it is enough to only use FUA (or some similar thing = emulated on ATA, for instance, drive's cache flush after each such = write) on journal writes. Yes, any feature that ensures partial ordering is sufficient. But using write caching without any such features is not. - anton -- M. Anton Ertl Some things have to be seen to be believed Most things have to be believed to be seen http://www.complang.tuwien.ac.at/anton/home.html |
Thread Tools | |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Out-of-order writing by disk drives | Anton Ertl | Storage (alternative) | 0 | April 7th 09 09:21 PM |
DMA issue while writing data to hard disk | maverick | Storage & Hardrives | 3 | May 28th 08 01:00 AM |
Worth writing zeros to my used hard drives? | Jax | Storage (alternative) | 18 | July 2nd 07 04:29 AM |
External USB disk reading/writing when not connected | Gustaf | Storage (alternative) | 1 | July 29th 06 06:16 PM |
external hard drives & cd-writing | travis | Storage (alternative) | 0 | February 24th 05 01:58 PM |