A computer components & hardware forum. HardwareBanter

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Go Back   Home » HardwareBanter forum » General Hardware & Peripherals » Storage & Hardrives
Site Map Home Register Authors List Search Today's Posts Mark Forums Read Web Partners

Disk optimization for multithreaded app.



 
 
Thread Tools Display Modes
  #1  
Old February 9th 10, 07:03 PM posted to comp.arch.storage
I Understand
external usenet poster
 
Posts: 1
Default Disk optimization for multithreaded app.

Mine is a disk based backup/restore product. It is multithreaded,
which means many backups and restores are happening simultaneously.
The product has basically three types of interactions with the disk.



1. Read. This is synchronous. Most reads happen on mounted
shadow copies (using VSS).

2. Write. FILE_FLAG_WRITE_THROUGH is used since data integrity
is critical to the product.

3. Compute checksums on the files.



Each of the above operations are done in their own dedicated threads
and currently any number of threads can be run at any point in time
doing any of the operations in parallel. Each operation targets
exactly one LUN. The size of the data varies from few KBs to Giga
bytes. Also, we may suggest clients to optimize their disk
architecture (use RAID, faster disks, etc.) but the clients may choose
to ignore suggestions.



My questions are,

1. We have noticed that when we run a Read thread and a compute
checksum thread in parallel against the same disk/spindle the disk
throughput degrades considerably and serializing the two threads gives
us much higher throughput overall. Is this behavior expected?

2. If the behavior in 1. is expected then how should we
serialize/parallelize the three kinds of threads to achieve optimal
performance.

3. Are there Windows APIs to determine the spindles we are
working against and throttle our threads accordingly to achieve
optimal performance?

4. Any pointers to case studies, experiments or white papers/
research papers by folks who have done this before?

  #2  
Old February 10th 10, 11:28 AM posted to comp.arch.storage
Bill Todd
external usenet poster
 
Posts: 162
Default Disk optimization for multithreaded app.

On 2/9/2010 1:03 PM, I Understand wrote:
Mine is a disk based backup/restore product. It is multithreaded,
which means many backups and restores are happening simultaneously.
The product has basically three types of interactions with the disk.



1. Read. This is synchronous. Most reads happen on mounted
shadow copies (using VSS).

2. Write. FILE_FLAG_WRITE_THROUGH is used since data integrity
is critical to the product.


You probably don't much care when individual writes complete as long as
they DO complete (and if they don't complete you're likely no better off
knowing that there aren't any holes left in what was actually written,
because if you don't have ALL of the file it's worthless and if you
can't fix any error you get it's likely because the disk is dead, dead,
dead).

If I recall correctly, NT and its descendants (unlike Unix) flush dirty
data to disk when a file is closed and report any error then. You might
be able to get around this by enabling the disks' write-back caches and
then issuing explicit commands to flush those caches to disk at the end
of your backup (in which case you DEFINITELY wouldn't want to use
FILE_FLAG_WRITE_THROUGH, since you'd want the small writes to accumulate
until the drive had a goodly number to sort to optimize their movement
to the platters). Flushing many unwritten small buffers (from multiple
threads) together or in quick sequence using command queuing also allows
the disk to optimize their progress to the platters even in the absence
of use of the disk's write-back cache (I don't know whether using
FILE_FLAG_WRITE_THROUGH would affect this approach, but why chance it?).

For files under a few MB in size, just write out the entire file in one
request. For larger files, use multi-MB buffers to maximize disk
utilization efficiency (you don't much care whether writing a multi-MB
buffer requires an initial seek because the overhead of that seek is
eclipsed by the transfer time) and DO write them through to disk (to
leave the disk's cache freer to be used to aggregate smaller writes for
queue optimization). If the 'disk' may in fact be a RAID, you might
consider submitting many large buffer write requests in parallel for a
single file (i.e., asynchronously from a single thread) in order to
allow as many disks as possible in the RAID to be working on them
simultaneously.

Preallocate any output file to its final size to avoid the overhead of
multiple intermediate allocations (and to maximize the probability that
it will be laid out contiguously on disk). I think you used to need to
use the Ntxxx native system calls to do this rather than documented
Win32 functions, but that may have changed (make sure that whatever you
do doesn't zero-fill the new file on disk before you populate it, though).


3. Compute checksums on the files.



Each of the above operations are done in their own dedicated threads
and currently any number of threads can be run at any point in time
doing any of the operations in parallel. Each operation targets
exactly one LUN. The size of the data varies from few KBs to Giga
bytes. Also, we may suggest clients to optimize their disk
architecture (use RAID, faster disks, etc.) but the clients may choose
to ignore suggestions.



My questions are,

1. We have noticed that when we run a Read thread and a compute
checksum thread in parallel against the same disk/spindle the disk
throughput degrades considerably and serializing the two threads gives
us much higher throughput overall. Is this behavior expected?


It is if you're processing large files and using small buffers. As for
writes above, use multi-MB buffers to read or checksum large files, so
that the data transfer times will overshadow any required seek activity
(then it won't matter nearly as much if multiple active threads
ping-pong between files on the same disk).


2. If the behavior in 1. is expected then how should we
serialize/parallelize the three kinds of threads to achieve optimal
performance.


See above. You could just serialize ALL activity to a single drive, but
for small files it would be better to batch up small requests (again, as
for the writes above) such that the disk's queue-optimization facilities
could run them more efficiently than anything you could do manually
would. Besides, if that 'drive' is actually a RAID array having many
requests outstanding will allow many drives in the array to be working
in parallel - so you might want to be able to submit large requests in
parallel as well, even from a single thread for the same file (in which
case reads would need to be asynchronous as well).

(By the way, the best way to achieve good throughput in such a RAID
array is to use per-disk chunk sizes in the multi-MB range - as long as
you've got enough large-file threads going at once or you extend your
batching to include even the multi-MB large file buffers such that
multiple drives can be servicing a single large file at once.)


3. Are there Windows APIs to determine the spindles we are
working against and throttle our threads accordingly to achieve
optimal performance?


Probably, but if you handle things as described above you shouldn't need
them: the large transfers will operate efficiently regardless of how
they're spread out over the disks, and if you batch up small requests
(both reads and writes) in parallel then the disk queue-optimization
mechanisms will have the opportunity to execute them with maximal
efficiency - again, regardless of how they're spread across the drives.


4. Any pointers to case studies, experiments or white papers/
research papers by folks who have done this before?


That's the kind of research that one usually does BEFORE creating a
product. But it sounds as if you need a much better understanding of
how disks work (and can best be used) first.

- bill
  #3  
Old February 10th 10, 12:50 PM posted to comp.arch.storage
Maxim S. Shatskih[_2_]
external usenet poster
 
Posts: 36
Default Disk optimization for multithreaded app.

If I recall correctly, NT and its descendants (unlike Unix) flush dirty
data to disk when a file is closed and report any error then.


No.

Nothing occurs when the file is closed. Flushes go after this by the lazy writer. Just copy a huge file in a Windows shell and look at flushing activity after the copy is reported done.

Preallocate any output file to its final size to avoid the overhead of
multiple intermediate allocations (and to maximize the probability that
it will be laid out contiguously on disk). I think you used to need to
use the Ntxxx native system calls


This will spend lots of time zeroing the newly allocated file. Not a way to make things faster.

--
Maxim S. Shatskih
Windows DDK MVP

http://www.storagecraft.com

  #4  
Old February 11th 10, 07:19 AM posted to comp.arch.storage
Bill Todd
external usenet poster
 
Posts: 162
Default Disk optimization for multithreaded app.

On 2/10/2010 6:50 AM, Maxim S. Shatskih wrote:
If I recall correctly, NT and its descendants (unlike Unix) flush dirty
data to disk when a file is closed and report any error then.


No.

Nothing occurs when the file is closed. Flushes go after this by the lazy writer. Just copy a huge file in a Windows shell and look at flushing activity after the copy is reported done.


Thanks - as I suggested, I wasn't certain; now that I think of it I
suspect that this is where 'delayed write failed' errors come from,
since there's no longer any application action with which to associate
them. The main thing that's required is a way to know when all such
flushing has completed so that you know when the backup has finished
successfully: tracking tagged commands is one way, explicit flushing on
Close is another, writing through to a disk's enabled write-back cache
and then explicitly flushing it later is a third (as long as Windows
doesn't interpret write-through as implying a disk cache flush), and
Unix-style system-wide sync calls are a fourth (but I don't think
Windows offers them).


Preallocate any output file to its final size to avoid the overhead of
multiple intermediate allocations (and to maximize the probability that
it will be laid out contiguously on disk). I think you used to need to
use the Ntxxx native system calls


This will spend lots of time zeroing the newly allocated file. Not a way to make things faster.


That's why I explicitly said NOT to do it that way. There at least used
to be an NTxxx Create function that accepted a preallocated size
parameter that did not result in zeroing out the file. I think that
later on MS added a way to do this without having to resort to
undocumented NTxxx functions, but can't remember the details.

Not having a way to preallocate a file without zeroing it out would be
really, really dumb (yes, the early documented interface was really,
really dumb, but at least they provided an undocumented mechanism to fix
that). I could imagine that some zeroing activity would still be
required if you populated the preallocated space out of sequence,
though, given how 'high water marking' works.

- bill

  #5  
Old February 11th 10, 11:43 AM posted to comp.arch.storage
Maxim S. Shatskih[_2_]
external usenet poster
 
Posts: 36
Default Disk optimization for multithreaded app.

Unix-style system-wide sync calls are a fourth (but I don't think
Windows offers them).


It does, FlushFileBuffers is fsync().

If you open the volume - like \\.\c: and do FlushFileBuffers on this handle - this is a total volume flush, metadata included.

That's why I explicitly said NOT to do it that way. There at least used
to be an NTxxx Create function that accepted a preallocated size
parameter that did not result in zeroing out the file.


ZwCreateFile with AllocationSize provided.

Well, maybe. NTFS has on-disk ValidDataLength, so, zeroing is not mandatory. For FAT, it is surely mandatory.

I think it is a good idea to try it. Some people told me once that this kind of creation _starts a background zeroing procedure_.

undocumented NTxxx functions, but can't remember the details.


ZwCreateFile is documented for kernel mode.

Also note that MS sometimes document the previously undocumented functions - like PsLookupProcessByProcessId, which was undocumented and was working OK from NT4 to Srv2003, and documented in Vista timeframe.

--
Maxim S. Shatskih
Windows DDK MVP

http://www.storagecraft.com

  #6  
Old February 11th 10, 11:23 PM posted to comp.arch.storage
[email protected]
external usenet poster
 
Posts: 37
Default Disk optimization for multithreaded app.

On Feb 11, 12:19*am, Bill Todd wrote:
That's why I explicitly said NOT to do it that way. *There at least used
to be an NTxxx Create function that accepted a preallocated size
parameter that did not result in zeroing out the file. *I think that
later on MS added a way to do this without having to resort to
undocumented NTxxx functions, but can't remember the details.

Not having a way to preallocate a file without zeroing it out would be
really, really dumb (yes, the early documented interface was really,
really dumb, but at least they provided an undocumented mechanism to fix
that). *I could imagine that some zeroing activity would still be
required if you populated the preallocated space out of sequence,
though, given how 'high water marking' works.



On NTFS, you can extend a file's allocation by seeking to the size you
want with SetFilePointer, and then doing a SetEndOfFile.

This will truncate a file if you've not gone past the end.

If you have gone past the end, physical space will be allocated, but
not cleared. The MFT entry for the file on NTFS includes a limit for
how much of the allocated space is valid. You *can* seek into that
space and it will read as zeros. If you write into that space, any
area beyond the current valid limit and the position where you write
will be zeroed at that point, but not the space past where you write.

Note that sparse files are different, and are usually created by
seeking past the end and then just writing.

You can muck with the zeroing some with SetFileValidData, which can,
prevent zeroing even on non-contiguous writes in some cases, exposing
the old data in the allocated clusters. As you might expect, most
user accounts don't have the SE_MANAGE_VOLUME_PRIVILEGE required to
use SetFileValidData.

Again this is NTFS only, and in many cases only on *local* NTFS
drives. If you do SetFilePointer/SetEndOfFile to grow a file on FAT,
for example, he will zero the space (since there's no concept of a
"valid" limit for an allocation in FAT).

NTFS also does anticipatory allocations if you're writing a file
sequentially, and tends to attempt to make an allocation several times
the size of the allocation required by your write, which it will
attempt to physically allocate in a few pieces as possible. The file
is physically truncated back down when it's closed. The exact
algorithm has changed several times. In Win2K it basically tried to
preallocate 16 times the space that was required to complete the write
(IOW, if the write required two additional clusters beyond the current
preallocation, NTFS would preallocate 32 additional clusters), up to a
limit of 1/1024th of the free space. In WinXP it became an
exponentially growing function (the first allocation would be done as-
is, the second doubled, the third quadrupled, up to some limit), again
with some other limits and whatnot factored in. Anyway, some semi-
useful documentation in an absolutely horrible format (an executable
self-extracting compressed Word document - ugh):

http://support.microsoft.com/kb/841551


  #7  
Old February 12th 10, 12:26 AM posted to comp.arch.storage
Bill Todd
external usenet poster
 
Posts: 162
Default Disk optimization for multithreaded app.

On 2/11/2010 5:43 AM, Maxim S. Shatskih wrote:
Unix-style system-wide sync calls are a fourth (but I don't think
Windows offers them).


It does, FlushFileBuffers is fsync().

If you open the volume - like \\.\c: and do FlushFileBuffers on this handle - this is a total volume flush, metadata included.


Ah - that gets around the limitation that I thought existed (that
FlushFileBuffers only applied to a single, already-accessed file). Not
exactly an obvious work-around, though - especially given that Windows
doesn't usually adopt the Unix "EVERYTHING can be accessed like a file"
philosophy.

Also, I suspect that opening a volume requires somewhat more privilege
than might otherwise be required to perform backups that don't need to
access otherwise inaccessible data (e.g., those performed by a single
user on their own data).


That's why I explicitly said NOT to do it that way. There at least used
to be an NTxxx Create function that accepted a preallocated size
parameter that did not result in zeroing out the file.


ZwCreateFile with AllocationSize provided.

Well, maybe. NTFS has on-disk ValidDataLength, so, zeroing is not mandatory. For FAT, it is surely mandatory.

I think it is a good idea to try it. Some people told me once that this kind of creation _starts a background zeroing procedure_.

undocumented NTxxx functions, but can't remember the details.


ZwCreateFile is documented for kernel mode.


But its user-mode NtCreateFile equivalent is not, last I knew.

Thanks for the added details,

- bill
  #8  
Old February 12th 10, 12:34 AM posted to comp.arch.storage
Bill Todd
external usenet poster
 
Posts: 162
Default Disk optimization for multithreaded app.

On 2/11/2010 5:23 PM, wrote:
On Feb 11, 12:19 am, Bill wrote:
That's why I explicitly said NOT to do it that way. There at least used
to be an NTxxx Create function that accepted a preallocated size
parameter that did not result in zeroing out the file. I think that
later on MS added a way to do this without having to resort to
undocumented NTxxx functions, but can't remember the details.

Not having a way to preallocate a file without zeroing it out would be
really, really dumb (yes, the early documented interface was really,
really dumb, but at least they provided an undocumented mechanism to fix
that). I could imagine that some zeroing activity would still be
required if you populated the preallocated space out of sequence,
though, given how 'high water marking' works.



On NTFS, you can extend a file's allocation by seeking to the size you
want with SetFilePointer, and then doing a SetEndOfFile.

This will truncate a file if you've not gone past the end.

If you have gone past the end, physical space will be allocated, but
not cleared. The MFT entry for the file on NTFS includes a limit for
how much of the allocated space is valid. You *can* seek into that
space and it will read as zeros. If you write into that space, any
area beyond the current valid limit and the position where you write
will be zeroed at that point, but not the space past where you write.

Note that sparse files are different, and are usually created by
seeking past the end and then just writing.

You can muck with the zeroing some with SetFileValidData, which can,
prevent zeroing even on non-contiguous writes in some cases, exposing
the old data in the allocated clusters. As you might expect, most
user accounts don't have the SE_MANAGE_VOLUME_PRIVILEGE required to
use SetFileValidData.

Again this is NTFS only, and in many cases only on *local* NTFS
drives. If you do SetFilePointer/SetEndOfFile to grow a file on FAT,
for example, he will zero the space (since there's no concept of a
"valid" limit for an allocation in FAT).

NTFS also does anticipatory allocations if you're writing a file
sequentially, and tends to attempt to make an allocation several times
the size of the allocation required by your write, which it will
attempt to physically allocate in a few pieces as possible. The file
is physically truncated back down when it's closed. The exact
algorithm has changed several times. In Win2K it basically tried to
preallocate 16 times the space that was required to complete the write
(IOW, if the write required two additional clusters beyond the current
preallocation, NTFS would preallocate 32 additional clusters), up to a
limit of 1/1024th of the free space. In WinXP it became an
exponentially growing function (the first allocation would be done as-
is, the second doubled, the third quadrupled, up to some limit), again
with some other limits and whatnot factored in. Anyway, some semi-
useful documentation in an absolutely horrible format (an executable
self-extracting compressed Word document - ugh):

http://support.microsoft.com/kb/841551

Thanks - what you describe above I once knew as the supported way to
accomplish preallocation (which I don't think originally existed in NT)
but had forgotten (at least in any detail - that's why I alluded to it
only vaguely above). Now I feel lazy for not having taken the time to
rediscover it.

- bill


  #9  
Old February 12th 10, 05:30 AM posted to comp.arch.storage
[email protected]
external usenet poster
 
Posts: 37
Default Disk optimization for multithreaded app.

On Feb 11, 5:34*pm, Bill Todd wrote:
Thanks - what you describe above I once knew as the supported way to
accomplish preallocation (which I don't think originally existed in NT)
but had forgotten (at least in any detail - that's why I alluded to it
only vaguely above). *Now I feel lazy for not having taken the time to
rediscover it.



Well SetEndOfFile dates back to NT3.1, and the documentation (I
actually have a copy of the Win32 API hardcopy doc from that era) is
similar to the current versions. SetFileValidData is definitely newer
than that, though.

Interestingly, both the old and new versions describe the data in the
extended area as being undefined.
 




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Optimization for Windows XP [email protected] Dell Computers 0 August 20th 07 07:04 PM
New version of sys_basher, a multithreaded system exerciser andbenchmarking tool General Schvantzkoph Intel 5 January 2nd 07 06:28 PM
New version of sys_basher, a multithreaded system exerciser andbenchmarking tool General Schvantzkoph AMD x86-64 Processors 5 January 2nd 07 06:28 PM
GA-7N400 Pro 2 Rev. 2.0 Optimization? William C Gigabyte Motherboards 4 June 25th 05 06:43 PM
A question about optimization David C. Intel 8 July 9th 03 09:29 PM


All times are GMT +1. The time now is 03:48 PM.


Powered by vBulletin® Version 3.6.4
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright ©2004-2024 HardwareBanter.
The comments are property of their posters.