If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. |
|
|
|
Thread Tools | Display Modes |
#11
|
|||
|
|||
Big volumes, small files
On 2007-04-21, Pete wrote:
I would also try to avoid NFS if possible. Having the clients mount the fs's as GPFS clients (tcpip or SAN) will probably be much better, and will avoid the bottlenecks and SPOFs of NFS. What features does GPFS have that will make in better than NFS? GPFS is a good filesystem with cluster support, but I don't see anything special that will help with the large numbers of files the OP is trying to deal with. Comparing GPFS to NFS is a bit apples to oranges, in that one is an access method to an underlying fs, while the other is a real fs.. But, besides own experience in that NFS is slow at serving small files I would point at a few features making it better than NFS for OP's problem: With GPFS you have two modes of giving access to disk. Either you give all nodes direct access trough SAN, or you let only a subset of the nodes access trough SAN and have them serve the disks as (in gpfs speak) Network Shared Disk (NSD). These NSD's can be accessed directly on SAN for the nodes that see them there, or they can be accessed trough tcp/ip to a node that again can access it on the SAN. A single NSD will typically have a primary and a seconary node serving it, to avoid SPOF. So already here we have higher availability than NFS, in that there's not a single node the client is depending on. Further, once you have more than one NSD in the same file system, the nodes will typically load balance the I/O over several NSD-serving nodes. Further, the NSD-serving nodes woun't be busy with file system operations, as I believe the NSD's are more like network block devices, so GPFS has distributed the filesystem operations away from the single NFS-server and out to the clients. I would assume this will work especially well for the OP's problem, in that he's mostly doing reads, and then woun't have to worry about overloading the lock manager. Random result from google Comparing GPFS/NSD to NFS: http://www.nus.edu.sg/comcen/svu/pub...erformance.pdf But, finding benchmark results for small file I/O is not easy.. FS's seems to be too much focused on high troughput, streaming I/O... -jf |
#12
|
|||
|
|||
Big volumes, small files
On Fri, 20 Apr 2007 22:24:31 -0400, Bill Todd
wrote: Faeandar wrote: ... ZFS is great in concept and I think they are on the right path, however it's not yet ready for primetime imo. Though (as I already noted) I don't have any direct experience with it, my impression is that people are using it in production systems successfully - so a description of your specific reservations would be useful. The integrated integrity checking is extremely cpu intesive. I suspect that you're mistaken: IIRC it occurs as part of an already-existing data copy operation at a very low level in the disk read/write routines, and at close to memory-streaming speeds (i.e., mostly using CPU cycles that are being used anyway just to copy the data). According to Sun, the integrity check and file system self-healing process is a permanent background process as well as the foreground checks you mention. In the case of a system that is completely idle of actual IO the system hung at around 40% performing these consistency checks. When IO is going on it backs off to some extent but it's still a hog. There is no disk IO that is close to memory speeds. The consistency checks and verifications involve checking data on platter. It does not cluster yet, at least not as of 2 weeks ago. It was not clear that this was a requirement in this case - but since the OP mentioned clustering, I mentioned the soon-to-arrive capability. Soon-to-arrive means 1.0. It's worth noting points like that. While ZFS is great in design it is still new. Many file systems grow dynamically so I would make that a check in ZFS's column. I'm not sure they grow dynamically quite as painlessly as ZFS does: usually, you first have to arrange to expand the underlying disk storage at the volume-manager level, and then have to incorporate the increase in volume size into the file system. It depends on the system, but these days those tasks are fairly simple. ZFS gets this extreme ease of use by not having a RAID controller between itself and the disks, which means a jbod (not everyone is keen on that yet). If you put a raid controller between them then Sun recommends turning off the consistency checking. Alot of what ZFS is depends on direct control of blocks. No practical TB limit is a win if you need to go beyond 16TB in a single FS. I'm not sure I see how snapshots or journaling helps with backups. I should have added the word 'respectively', I guess: journaling helps avoid the need for fsck, and snapshots help expedite backups (by avoiding any need for down-time while making them). True, but my example of the NetApp filer demonstrates that just because you don't need downtime to do the backup it is still extremely painful in an environment like what the OP describes. It still has to map blocks to files, which is the long part of a backup. I know when NetApp backups occur it takes the snapshot and then tries to do a dump. If you have millions of files it can be hours before data is actually transferred, I believe ZFS is no different. Actually, it is, since it allows block sizes up to 128 KB (vs. 4 KB for WAFL IIRC, though if WAFL does a good job of defragmenting files the difference may not be too substantial). With the OP's 100 KB file sizes, this means that each file can be accessed (backed up) with a single disk access, yielding a fairly respectable backup bandwidth of about 6 MB/sec (assuming that such an access takes about 16 ms. for a 7200 rpm drive, including transfer time, and that the associated directory accesses can be batched during the scan). It's not the transfer I was referring to but rather the mapping (phase I and II of a dump). I believe ZFS still has to map the files to blocks even if it's a one to one ratio. At millions of files this can be painful. Once those phases are done the transfer rates are probably full pipe. Also, in the 100KB file to 128KB block ratio you lose what, 20% of your capacity? Big trade off in some environments. Since the OP's IO pattern is mostly reads the cpu load may not be an issue but writes suffer a serious penalty if you are not cpu-rich. I'm not sure why that would be the case even if the integrity-checking *were* CPU-intensive, since the overhead to check the integrity on a read should be just about the same as the overhead to generate the checksum on a write. True, one must generate it all the way back up to the system superblock for a write (one reason why I prefer a log-oriented implementation that can defer and consolidate such activity), but below the root unless you've got many of the intermediate-level blocks cached you have to access and validate them on each read (and with on the order of a billion files, my guess is that needed directory data will quite frequently not be cached). In this case ZFS would also be doing the raid. If you're using a raid controller the rules change, as do the features. ~F |
#13
|
|||
|
|||
Big volumes, small files
Faeandar wrote:
On Fri, 20 Apr 2007 22:24:31 -0400, Bill Todd wrote: Faeandar wrote: ... ZFS is great in concept and I think they are on the right path, however it's not yet ready for primetime imo. Though (as I already noted) I don't have any direct experience with it, my impression is that people are using it in production systems successfully - so a description of your specific reservations would be useful. The integrated integrity checking is extremely cpu intesive. I suspect that you're mistaken: IIRC it occurs as part of an already-existing data copy operation at a very low level in the disk read/write routines, and at close to memory-streaming speeds (i.e., mostly using CPU cycles that are being used anyway just to copy the data). According to Sun, the integrity check and file system self-healing process is a permanent background process as well as the foreground checks you mention. Yes, but there's no reason for that to take up very much in the way of resources (e.g., the last study I saw in this area indicated that a full integrity sweep once every couple of months was more than adequate to cut the incidence of latent errors - unnoticed corruption that jumps up to bite you after the *good* copy dies - down by at least an order of magnitude. In the case of a system that is completely idle of actual IO the system hung at around 40% performing these consistency checks. That's a ridiculous amount to use as the default (well, at least for production software - if they're still using pure idle time heavily to reassure customers due to ZFS's newness that might explain it), and I would be very surprised if it weren't at least tunable to a much lesser amount. When IO is going on it backs off to some extent but it's still a hog. There is no disk IO that is close to memory speeds. The consistency checks and verifications involve checking data on platter. Of course they do, and I never suggested otherwise. What can move at close to memory speeds is the *CPU* overhead involved in the checks, and it can piggyback on a memory-to-memory data move that is happening anyway (such that few *extra* CPU cycles beyond what would already be consumed in the move are required). It does not cluster yet, at least not as of 2 weeks ago. It was not clear that this was a requirement in this case - but since the OP mentioned clustering, I mentioned the soon-to-arrive capability. Soon-to-arrive means 1.0. It's worth noting points like that. While ZFS is great in design it is still new. Everything starts off new. The question is when a product becomes usable in production, and that's something that's measured far more by customer experience than by a clock. My impression is that *some* customers have workloads that have found ZFS to be very stable already, while others push corner cases that are still uncovering bugs (I haven't heard of any for a while that involve actual data corruption, but I haven't been paying close attention, either). Many file systems grow dynamically so I would make that a check in ZFS's column. I'm not sure they grow dynamically quite as painlessly as ZFS does: usually, you first have to arrange to expand the underlying disk storage at the volume-manager level, and then have to incorporate the increase in volume size into the file system. It depends on the system, but these days those tasks are fairly simple. ZFS gets this extreme ease of use by not having a RAID controller between itself and the disks, which means a jbod (not everyone is keen on that yet). Their loss, unless they need the raw single-operation low-latency write-through performance that NVRAM hardware assist can give to a hardware RAID box. .... It still has to map blocks to files, which is the long part of a backup. I know when NetApp backups occur it takes the snapshot and then tries to do a dump. If you have millions of files it can be hours before data is actually transferred, I believe ZFS is no different. Actually, it is, since it allows block sizes up to 128 KB (vs. 4 KB for WAFL IIRC, though if WAFL does a good job of defragmenting files the difference may not be too substantial). With the OP's 100 KB file sizes, this means that each file can be accessed (backed up) with a single disk access, yielding a fairly respectable backup bandwidth of about 6 MB/sec (assuming that such an access takes about 16 ms. for a 7200 rpm drive, including transfer time, and that the associated directory accesses can be batched during the scan). It's not the transfer I was referring to but rather the mapping (phase I and II of a dump). I believe ZFS still has to map the files to blocks even if it's a one to one ratio. The one-to-one ratio is what makes the difference (at least in this particular case, and even in general the ratio is considerably better than a non-extent-based file system that uses a 4 KB block size). At millions of files this can be painful. Not with ZFS in this instance, unless one constructs a pathological case with a deep directory structure and only one or two files mapped per deep path traversal: otherwise, the mapping can proceed at less than one mapping access per 100 KB file (if each leaf directory has multiple files to be mapped), plus the eventual transfer access itself. Once those phases are done the transfer rates are probably full pipe. Also, in the 100KB file to 128KB block ratio you lose what, 20% of your capacity? Big trade off in some environments. But likely not in this one: it's just not that large a system, nor are the disks very expensive if they're SATA. Since the OP's IO pattern is mostly reads the cpu load may not be an issue but writes suffer a serious penalty if you are not cpu-rich. I'm not sure why that would be the case even if the integrity-checking *were* CPU-intensive, since the overhead to check the integrity on a read should be just about the same as the overhead to generate the checksum on a write. True, one must generate it all the way back up to the system superblock for a write (one reason why I prefer a log-oriented implementation that can defer and consolidate such activity), but below the root unless you've got many of the intermediate-level blocks cached you have to access and validate them on each read (and with on the order of a billion files, my guess is that needed directory data will quite frequently not be cached). In this case ZFS would also be doing the raid. If you're using a raid controller the rules change, as do the features. I have no idea how your comment is meant to relate to the material it's responding to above. - bill |
|
Thread Tools | |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Increasing disk performance with many small files (NTFS/ Windowsroaming profiles) | Benno... | Storage & Hardrives | 18 | July 23rd 04 12:41 PM |
Increasing disk performance with many small files (NTFS/ Windowsroaming profiles) | Benno... | Storage (alternative) | 17 | July 23rd 04 12:41 PM |
performance degradation backing up small files | alan | Storage & Hardrives | 2 | April 27th 04 05:47 AM |
SDLT wear & tear (small files vs. big files) | George Sarlas | Storage & Hardrives | 12 | September 29th 03 11:07 PM |