Big volumes, small files

**Jan-Frode Myklebust[_2_]** · #11 April 22nd 07, 11:26 AM posted to comp.arch.storage

On 2007-04-21, Pete wrote:

I would also try to avoid NFS if possible. Having the clients mount the
fs's as GPFS clients (tcpip or SAN) will probably be much better, and
will avoid the bottlenecks and SPOFs of NFS.

What features does GPFS have that will make in better than NFS? GPFS is
a good filesystem with cluster support, but I don't see anything special
that will help with the large numbers of files the OP is trying to deal
with.

Comparing GPFS to NFS is a bit apples to oranges, in that one is an access
method to an underlying fs, while the other is a real fs.. But, besides
own experience in that NFS is slow at serving small files I would point
at a few features making it better than NFS for OP's problem:

With GPFS you have two modes of giving access to disk. Either you give
all nodes direct access trough SAN, or you let only a subset of the nodes
access trough SAN and have them serve the disks as (in gpfs speak) Network
Shared Disk (NSD). These NSD's can be accessed directly on SAN for the
nodes that see them there, or they can be accessed trough tcp/ip to a node
that again can access it on the SAN. A single NSD will typically have
a primary and a seconary node serving it, to avoid SPOF. So already here
we have higher availability than NFS, in that there's not a single node
the client is depending on.

Further, once you have more than one NSD in the same file system, the nodes
will typically load balance the I/O over several NSD-serving nodes.

Further, the NSD-serving nodes woun't be busy with file system operations,
as I believe the NSD's are more like network block devices, so GPFS has
distributed the filesystem operations away from the single NFS-server
and out to the clients. I would assume this will work especially well for
the OP's problem, in that he's mostly doing reads, and then woun't have
to worry about overloading the lock manager.

Random result from google Comparing GPFS/NSD to NFS:
http://www.nus.edu.sg/comcen/svu/pub...erformance.pdf

But, finding benchmark results for small file I/O is not easy.. FS's seems
to be too much focused on high troughput, streaming I/O...

-jf

**Faeandar** · #12 April 25th 07, 01:17 AM posted to comp.arch.storage

On Fri, 20 Apr 2007 22:24:31 -0400, Bill Todd
wrote:

Faeandar wrote:

...

ZFS is great in concept and I think they are on the right path,
however it's not yet ready for primetime imo.

Though (as I already noted) I don't have any direct experience with it,
my impression is that people are using it in production systems
successfully - so a description of your specific reservations would be
useful.

The integrated integrity checking is extremely cpu intesive.

I suspect that you're mistaken: IIRC it occurs as part of an
already-existing data copy operation at a very low level in the disk
read/write routines, and at close to memory-streaming speeds (i.e.,
mostly using CPU cycles that are being used anyway just to copy the data).

According to Sun, the integrity check and file system self-healing
process is a permanent background process as well as the foreground
checks you mention. In the case of a system that is completely idle
of actual IO the system hung at around 40% performing these
consistency checks. When IO is going on it backs off to some extent
but it's still a hog.
There is no disk IO that is close to memory speeds. The consistency
checks and verifications involve checking data on platter.

It does
not cluster yet, at least not as of 2 weeks ago.

It was not clear that this was a requirement in this case - but since
the OP mentioned clustering, I mentioned the soon-to-arrive capability.

Soon-to-arrive means 1.0. It's worth noting points like that. While
ZFS is great in design it is still new.

Many file systems
grow dynamically so I would make that a check in ZFS's column.

I'm not sure they grow dynamically quite as painlessly as ZFS does:
usually, you first have to arrange to expand the underlying disk storage
at the volume-manager level, and then have to incorporate the increase
in volume size into the file system.

It depends on the system, but these days those tasks are fairly
simple. ZFS gets this extreme ease of use by not having a RAID
controller between itself and the disks, which means a jbod (not
everyone is keen on that yet). If you put a raid controller between
them then Sun recommends turning off the consistency checking. Alot
of what ZFS is depends on direct control of blocks.

No
practical TB limit is a win if you need to go beyond 16TB in a single
FS.
I'm not sure I see how snapshots or journaling helps with backups.

I should have added the word 'respectively', I guess: journaling helps
avoid the need for fsck, and snapshots help expedite backups (by
avoiding any need for down-time while making them).

True, but my example of the NetApp filer demonstrates that just
because you don't need downtime to do the backup it is still extremely
painful in an environment like what the OP describes.

It
still has to map blocks to files, which is the long part of a backup.
I know when NetApp backups occur it takes the snapshot and then tries
to do a dump. If you have millions of files it can be hours before
data is actually transferred, I believe ZFS is no different.

Actually, it is, since it allows block sizes up to 128 KB (vs. 4 KB for
WAFL IIRC, though if WAFL does a good job of defragmenting files the
difference may not be too substantial). With the OP's 100 KB file
sizes, this means that each file can be accessed (backed up) with a
single disk access, yielding a fairly respectable backup bandwidth of
about 6 MB/sec (assuming that such an access takes about 16 ms. for a
7200 rpm drive, including transfer time, and that the associated
directory accesses can be batched during the scan).

It's not the transfer I was referring to but rather the mapping (phase
I and II of a dump). I believe ZFS still has to map the files to
blocks even if it's a one to one ratio. At millions of files this can
be painful. Once those phases are done the transfer rates are
probably full pipe.
Also, in the 100KB file to 128KB block ratio you lose what, 20% of
your capacity? Big trade off in some environments.

Since the OP's IO pattern is mostly reads the cpu load may not be an
issue but writes suffer a serious penalty if you are not cpu-rich.

I'm not sure why that would be the case even if the integrity-checking
*were* CPU-intensive, since the overhead to check the integrity on a
read should be just about the same as the overhead to generate the
checksum on a write. True, one must generate it all the way back up to
the system superblock for a write (one reason why I prefer a
log-oriented implementation that can defer and consolidate such
activity), but below the root unless you've got many of the
intermediate-level blocks cached you have to access and validate them on
each read (and with on the order of a billion files, my guess is that
needed directory data will quite frequently not be cached).

In this case ZFS would also be doing the raid. If you're using a raid
controller the rules change, as do the features.

~F

**Bill Todd** · #13 April 25th 07, 04:48 AM posted to comp.arch.storage

Faeandar wrote:
On Fri, 20 Apr 2007 22:24:31 -0400, Bill Todd
wrote:

Faeandar wrote:

...

ZFS is great in concept and I think they are on the right path,
however it's not yet ready for primetime imo.
Though (as I already noted) I don't have any direct experience with it,
my impression is that people are using it in production systems
successfully - so a description of your specific reservations would be
useful.

The integrated integrity checking is extremely cpu intesive.
I suspect that you're mistaken: IIRC it occurs as part of an
already-existing data copy operation at a very low level in the disk
read/write routines, and at close to memory-streaming speeds (i.e.,
mostly using CPU cycles that are being used anyway just to copy the data).

According to Sun, the integrity check and file system self-healing
process is a permanent background process as well as the foreground
checks you mention.

Yes, but there's no reason for that to take up very much in the way of
resources (e.g., the last study I saw in this area indicated that a full
integrity sweep once every couple of months was more than adequate to
cut the incidence of latent errors - unnoticed corruption that jumps up
to bite you after the *good* copy dies - down by at least an order of
magnitude.

In the case of a system that is completely idle
of actual IO the system hung at around 40% performing these
consistency checks.

That's a ridiculous amount to use as the default (well, at least for
production software - if they're still using pure idle time heavily to
reassure customers due to ZFS's newness that might explain it), and I
would be very surprised if it weren't at least tunable to a much lesser
amount.

When IO is going on it backs off to some extent
but it's still a hog.
There is no disk IO that is close to memory speeds. The consistency
checks and verifications involve checking data on platter.

Of course they do, and I never suggested otherwise. What can move at
close to memory speeds is the *CPU* overhead involved in the checks, and
it can piggyback on a memory-to-memory data move that is happening
anyway (such that few *extra* CPU cycles beyond what would already be
consumed in the move are required).

It does
not cluster yet, at least not as of 2 weeks ago.
It was not clear that this was a requirement in this case - but since
the OP mentioned clustering, I mentioned the soon-to-arrive capability.

Soon-to-arrive means 1.0. It's worth noting points like that. While
ZFS is great in design it is still new.

Everything starts off new. The question is when a product becomes
usable in production, and that's something that's measured far more by
customer experience than by a clock.

My impression is that *some* customers have workloads that have found
ZFS to be very stable already, while others push corner cases that are
still uncovering bugs (I haven't heard of any for a while that involve
actual data corruption, but I haven't been paying close attention, either).

Many file systems
grow dynamically so I would make that a check in ZFS's column.
I'm not sure they grow dynamically quite as painlessly as ZFS does:
usually, you first have to arrange to expand the underlying disk storage
at the volume-manager level, and then have to incorporate the increase
in volume size into the file system.

It depends on the system, but these days those tasks are fairly
simple. ZFS gets this extreme ease of use by not having a RAID
controller between itself and the disks, which means a jbod (not
everyone is keen on that yet).

Their loss, unless they need the raw single-operation low-latency
write-through performance that NVRAM hardware assist can give to a
hardware RAID box.

....

It
still has to map blocks to files, which is the long part of a backup.
I know when NetApp backups occur it takes the snapshot and then tries
to do a dump. If you have millions of files it can be hours before
data is actually transferred, I believe ZFS is no different.
Actually, it is, since it allows block sizes up to 128 KB (vs. 4 KB for
WAFL IIRC, though if WAFL does a good job of defragmenting files the
difference may not be too substantial). With the OP's 100 KB file
sizes, this means that each file can be accessed (backed up) with a
single disk access, yielding a fairly respectable backup bandwidth of
about 6 MB/sec (assuming that such an access takes about 16 ms. for a
7200 rpm drive, including transfer time, and that the associated
directory accesses can be batched during the scan).

It's not the transfer I was referring to but rather the mapping (phase
I and II of a dump). I believe ZFS still has to map the files to
blocks even if it's a one to one ratio.

The one-to-one ratio is what makes the difference (at least in this
particular case, and even in general the ratio is considerably better
than a non-extent-based file system that uses a 4 KB block size).

At millions of files this can
be painful.

Not with ZFS in this instance, unless one constructs a pathological case
with a deep directory structure and only one or two files mapped per
deep path traversal: otherwise, the mapping can proceed at less than
one mapping access per 100 KB file (if each leaf directory has multiple
files to be mapped), plus the eventual transfer access itself.

Once those phases are done the transfer rates are
probably full pipe.
Also, in the 100KB file to 128KB block ratio you lose what, 20% of
your capacity? Big trade off in some environments.

But likely not in this one: it's just not that large a system, nor are
the disks very expensive if they're SATA.

Since the OP's IO pattern is mostly reads the cpu load may not be an
issue but writes suffer a serious penalty if you are not cpu-rich.
I'm not sure why that would be the case even if the integrity-checking
*were* CPU-intensive, since the overhead to check the integrity on a
read should be just about the same as the overhead to generate the
checksum on a write. True, one must generate it all the way back up to
the system superblock for a write (one reason why I prefer a
log-oriented implementation that can defer and consolidate such
activity), but below the root unless you've got many of the
intermediate-level blocks cached you have to access and validate them on
each read (and with on the order of a billion files, my guess is that
needed directory data will quite frequently not be cached).

In this case ZFS would also be doing the raid. If you're using a raid
controller the rules change, as do the features.

I have no idea how your comment is meant to relate to the material it's
responding to above.

- bill

Thread Tools
Show Printable Version Email this Page
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Increasing disk performance with many small files (NTFS/ Windowsroaming profiles)	Benno...	Storage & Hardrives	18	July 23rd 04 12:41 PM
Increasing disk performance with many small files (NTFS/ Windowsroaming profiles)	Benno...	Storage (alternative)	17	July 23rd 04 12:41 PM
performance degradation backing up small files	alan	Storage & Hardrives	2	April 27th 04 05:47 AM
SDLT wear & tear (small files vs. big files)	George Sarlas	Storage & Hardrives	12	September 29th 03 11:07 PM